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Algorithms for large data sets, unlike in-core algorithms, have to keep the bulk of their 
data in the secondary memory, which typically is much slower than the main memory. 
In designing these out-of-core algorithms, the goal is therefore to minimise the number of 
I/Os executed. The literature is rich in efficient out-of-core algorithms for matrix compu- 
tation. But very few of them are designed on the external memory model of Aggarwal and 
Vitter, and as such attempt to quantify their performances in terms of the number of I/Os 
performed. This thesis makes some contributions in that direction. 

We analyse some QR decomposition algorithms, and show that the I/O complexity of 
the tile based algorithm is asymptotically the same as that of matrix multiplication. This 
algorithm, we show, performs the best when the tile size is chosen so that exactly one tile 
fits in the main memory. We propose a constant factor improvement, as well as a new 
recursive cache oblivious algorithm with the same asymptotic I/O complexity. 

The traditional unblocked and blocked Hessenberg, tridiagonal, and bidiagonal reduc- 
tions are not I/O efficient because vector- matrix operations dominate their performances. 
We design Hessenberg, tridiagonal, and bidiagonal reductions that use banded intermedi- 
ate forms, and perform only asymptotically optimal numbers of I/Os; these are the first 
I/O optimal algorithms for these problems. 

In particular, we show that known slab based algorithms for two sided reductions all 
have suboptimal asymptotic I/O performances, even though they have been reported to 
do better than the traditional algorithms on the basis of empirical evidence. 

We propose new tile based variants of multishift QR and QZ algorithms that under 
certain conditions on the number of shifts, have better seek and I/O complexities than all 
known variants. 

We show that techniques like rescheduling of computational steps, appropriate choosing 

of the blocking parameters and incorporating of more matrix-matrix operations, can be 
used to improve the I/O and seek complexities of matrix computations. 
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Chapter 1 
Introduction 



Computing applications that deal with massive data sets abound today. Examples can be 
found in databases, geographic information systems, VLSI verification, computerised medi- 
cal treatment, astrophysics, geophysics, multimedia, computer graphics, virtual reality, 3D 
simulation and modelling, sensors, web applications, network traffic analysis, constraint 
logic programming, computational biology, to name a few. 

Traditional "in-core" algorithms assume that the main memory is infinite in size and 
allows random uniform access to all its locations, and therefore, that all their data fit in the 
main memory. Algorithms for large data sets (often called out-of-core (OOC) or external 
memory algorithms) cannot assume this. In reality, the main memory is limited; so, these 
algorithms have to keep the bulk of their data in the secondary memory, which typically 
is much slower than the main memory. 

The usual performance metric for in-core algorithms is the number of instructions ex- 
ecuted. However, for OOC algorithms, the number of inputs/outputs (I/Os) executed 
would be a more appropriate performance metric. So, in OOC algorithm design, the goal 
is to minimise the number of I/Os. OOC algorithms are therefore sometimes called I/O 
efficient algorithms. 

In the last few decades, computers have become a lot faster, and the amount of main 
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memory they have has grown. But the issue of the main memory being hmited has only 
become more relevant, because applications have grown even faster in size. Also, small 
computing devices (e.g., sensors, smart phones) with limited memories have found several 
uses. 

1.1 The External Memory Model 

Designing of I/O efficient algorithms has, therefore, been an actively researched area in 
the last few years. For many a problem, it has been shown that the traditionally efficient 
algorithm is not very I/O efficient, and that novel and very different design techniques can 
be used to produce a much more I/O efficient algorithm. The external memory model of 
Aggarwal and Vitter [1 11107] has been used to design many of these algorithms. This model 
has a single processor and a two level memory. It is assumed that the bulk of the data is 
kept in the secondary memory (disk) which is a permanent storage. The secondary memory 
is divided into blocks. An I/O is defined as the transfer of a block of data between the 
secondary memory and a volatile main memory, which is limited in size. The processor's 
clock period and main memory access time are negligible when compared to secondary 
memory access time. The measure of performance of an algorithm is the number of I/Os it 
performs. The model defines the following parameters: the size of the main memory (M), 
and the size of a disk block {B < M). (The two levels might as well be the cache and the 
main memory.) 

We add another performance metric, namely, the number of seeks, to the model. A seek 
is the task of positioning the read/write head of the disk to the required data block. In 
reality, the cost of seeking a block will depend on the location of the block on the disk. We 
make the simplistic assumption that every block can be sought at the same cost. We also 
assume that, irrespective of the size of the data block, all contiguous data can be accessed 
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using a single seek. We define the seek complexity of an algorithm as the number of times 
the algorithm leaves one data locality for another, (or, in other words, chooses to read or 
write a block that is not physically contiguous with the last block read or written) thereby 
necessitating a seek. 

Suppose that an algorithm performs k seeks, and that after the i-th seek it moves Sj 
data items between the disk and the main memory. Let rii = \^~\B, and s = Yli=i 
n = Elin.. Then f = Eli f < EIJH < Eli(| + 1) = i + k- That is, the number 
of I/Os performed by the algorithm is at most the number of blocks of data items moved 
plus the number of seeks. 

Throughout this thesis, unless stated otherwise, we assume that M > B^. This is a 
realistic assumption to make, and is called the "tall cache assumption" |5l] . 

Algorithms that are oblivious of the memory hierarchy, and nevertheless use the hi- 
erarchy efficiently have received particular attention because they are portable [51]. A 
hierarchy oblivious algorithm leaves the page replacement decisions to the operating sys- 
tem (and still executes efficiently), whereas a hierarchy aware algorithm is assumed to 
have complete control over the secondary memory. Hierarchy oblivious algorithms are 
often called "cache oblivious" in literature. 

1.2 OOC Algorithms for Matrix Computations 

Early work on external memory algorithms were largely concentrated on fundamental prob- 
lems like sorting, permutating, graph problems, computational geometry and string match- 
ing problems [H[9|[29 | [30 | H0 | [59 | [M |ll06iri08] . External memory algorithms for fundamen- 
tal matrix operations like matrix multiplication and matrix transposition were proposed 
in IHllQSj. Not many linear algebra algorithms have been designed or analysed explicitly 
on the external memory model. 
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Software libraries that implement various linear algebra operations have been in devel- 
opment over many years. BLAS [I5lllllll2] provides kernel routines to implement a variety 
of fundamental vector and matrix operations. The software library LAPACK [7J, based on 
BLAS routines and designed for sequential computers, implements a variety of advanced 
linear algebra routines. LAPACK does not provide explicit OOC capabilities, and has to 
rely on virtual memory systems to perform out-of-core computations. For larger problems, 
an extension of LAPACK called ScaLAPACK [31] is available. It is designed for distributed 
memory parallel architectures, and includes prototype OOC implementations of some liner 
system solvers. SOLAR [M] further extends LAPACK and ScaLAPACK, by adding OOC 
capabilities to them in a portable fashion. SOLAR routines use ScaLAPACK routines for 
in-core computations and manage matrix inputs and outputs using an I/O layer. PLA- 
PACK [51 1103] , an extension of LAPACK for parallel computations, provides for coding 
parallel linear algebra algorithms at a high level of abstraction. POOCLAPACK [93] is an 
OOC extension of PLAPACK. It provides parallel implementations of OOC linear algebra 
operations. It uses PLAPACK routines for in-core parallel computations. 

Even apart from the above, the literature is rich in efficient out-of-core algorithms. For 
example, dense-matrix factorizations like LU-factorization, Cholesky factorization, QR- 
factorization, solving of systems of linear equations, standard and generalized eigenvalue 
problems have been investigated from the out-of-core perspective. But very few of them are 
designed on the external memory model, and as such attempt to quantify the performance 
in terms of the number of I/Os performed. Without such analyses, it is often difficult to 
say for sure which algorithm will do better under a given circumstance. This thesis makes 
some contributions in that direction. 

See |100] for a survey on out-of-core algorithms in linear algebra. Also see [50] for 
a survey on recursive blocked algorithms and hybrid data structures for problems like 
Cholesky and QR decompositions, and solving of general linear systems. 
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Most of the computations involving vector-vector (V-V) or vector-matrix (V-M) op- 
erations, perform 0{N'^/B) I/Os while executing 0{N'^) flops thus giving the so called 
surface-to-surface effect to the ratio of flops to I/Os [5811115] . It has been observed that 
casting computations in terms of matrix-matrix (M-M) operations improves performance 
by reducing I/Os [23ll44[l58[lll5] . The reason is that M-M operations perform only 0{N'^/B) 
I/Os while executing 0{N^) flops giving the so called surface-to-volume effect to the ratio 
of flops to I/Os [44J. 

Algorithms can be recast to be rich in M-M operations using either the slab or the tile 
approach. In the former, the matrix is partitioned into vertical slabs, which are processed 
one after the other; between two slabs the whole of the matrix may have to be updated using 
an aggregate of transformations generated from the processing of the slab; this updation 
could be done using M-M operations. Slab based algorithms for Cholesky, LU and QR 
decompositions that use aggregations of Gauss elimination or Householder transformations 
have been reported [HlllHlES]. In the tile approach, the matrix is typically partitioned 
into square tiles. It has been observed that the tile approach results in more scalable 
out-of-core algorithms for Cholesky [631I931IT0T] and QR decompositions [G^, and in more 
granular and asynchronous parallel algorithms on multicore architectures for LU, Cholesky, 
and QR decompositions |28] . 

1.3 Contributions of this Thesis 

In this thesis, we present I/O efficient algorithms for some matrix problems. We also 
analyse some known algorithms on the external memory model. 
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QR Decomposition 



For every matrix A G M 



NxP 



, there exist an orthogonal matrix Q G M 



NxN 



and upper 



triangular matrix R G M^^^ such that A = QR. The decomposition of A into Q and R is 
called a QR-decomposition of A. 

QR-decomposition is useful in solving eigenvalue problems, systems of linear equations 
and linear least-square problems. We investigate the various out-of-core QR-decomposition 
algorithms in Chapter 2. 

A QR decomposition of a matrix can be obtained using Givens rotations, orthogonali- 
sation via Gram-Schmidt and Modified Gram-Schmidt methods, or Householder transfor- 
mations [58 |lll5j . Stability of the system, dimensions of the matrix, architecture of the 
machine and how the factorization is to be subsequently used are some of the factors that 
influence the choice 164\ . 



We focus on Householder based methods for QR-decomposition, because these exploit 
locality of reference better than rotation based algorithms [211 [781 [79] and therefore are 
more amenable to OOC computations. 

The traditional QR decomposition based on Householder transformations is rich in 
matrix- vector, and not matrix-matrix operations. Efficient parallel and OOC computations 
require algorithms rich in the latter. Exploiting data locality to recast algorithms in terms 
of matrix- matrix multiplication has received much attention [71[25 1 I55 1 HT | I ^ [ ^I1U1] . For 
QR decomposition using Householder transformations, slab based [71 [231 [HI [96] tile 
based [281 [61] algorithms have been suggested. A slab based algorithm reads a slab of 
columns into the main memory at a time. A tile based algorithm partitions the matrix 
into a number of square tiles, and processes one tile at a time. 

We analyse some QR decomposition algorithms for their I/O complexity. We show 
that the traditional algorithm based on Householder transformations |115j has an I/O 
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complexity of 0(A^Pmin{A^, P}/B). 

The slab based algorithm of |96] is analysed. We choose this algorithm because of 
its simplicity. (There have been slab based algorithms that build on this. Recursive 
in-core algorithms that partition the matrix vertically are given in [l8lll9]. An OOC 
implementation of these is given in ^1].) We consider an implementation of the algorithm 
where matrix multiplication is done using the blocked algorithm of [58]. We find that the 
I/O cost of this implementation is of the same order as that of matrix multiplication for 
most choices of the problem parameters, and that the optimal choice for slab width is not 
necessarily the number of columns that can be fit in the main memory. 

We also analyse the tile based algorithm of jM], and show that its I/O complexity is of 
the same order as that of matrix multiplication. We also suggest a new tile based algorithm 
that improves the I/O complexity by an 0(1) factor. 

We also present a new cache-oblivious QR-decomposition algorithm, and its analysis. 
This algorithm is useful when Q needs to be reported explicitly. Its I/O complexity is of 
the same order as that of the tile based algorithm. 

Banded Hessenberg reduction and related problems 

Computing of eigenvalues and singular values of matrices have wide-ranging applications 
in engineering and computational sciences. 

The eigenvalues of a square matrix are typically computed in a two stage process. In 
the first stage the matrix is reduced, through a sequence of orthogonal similarity transfor- 
mations (OSTs), to a condensed form (Hessenberg form, if the matrix is nonsymmetric, 
and tridiagonal form, otherwise), and in the second stage an iterative method called the 
QR algorithm is applied to the condensed form [58 (1115] . The reduction of a nonsymmetric 
N X N matrix to upper Hessenberg form as well as a symmetric N x N matrix to tridiag- 
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onal form using OSTs takes 0{N^) flops. Each iteration of the QR algorithm takes 0{N'^) 
and 0{N) flops respectively, when applied on Hessenberg and tridiagonal matrices. 

The singular values of a matrix are also typically computed in a similar two stage process. 
First the matrix is reduced to bidiagonal form using orthogonal equivalence transformations 
(OETs); this takes 0{N^) flops. Then the modified QR algorithm |57| is apphed to the 
bidiagonal matrix. Each iteration of the QR algorithm takes 0{N) flops. 

(An OST on a square matrix A G M^^^ transforms A into Q^AQ using an orthogonal 
matrix Q. An OET on a matrix A G M^^^ transforms A into Q^AP using orthogonal 
matrices Q and P. OSTs and OETs preserve eigenvalues.) 

Thus, in each case, the first stage of the computation, which costs 0{N^) flops, forms 
a bottleneck. Each of these algorithms also needs 0{N^/B) I/Os to perform its 0{N^) 
operations. 

The traditional Hessenberg, tridiagonal, and bidiagonal reduction algorithms based on 
Householder transformations (Householders) or rotators are rich in V-V and V-M oper- 
ations, and not in M-M operations [5811115] . Householder based sequential and parallel 
algorithms for Hessenberg, tridiagonal and bidiagonal reductions using the slab approach 
have been proposed |l5lll6l|89]. But even in these, V-M operations dominate the perfor- 
mance: after the reduction of each column of a slab, the rest of the matrix needs to be read 
in to update the next column before it will be ready to be reduced. This is because of the 
two sidedness of the transformations involved. This also makes it difficult to design tile 
based versions of these algorithms. To the best of our knowledge, no tile based algorithm 
has been proposed for the above direct reductions. 

Due to the above reasons, it has been proposed that the reduction in the first stage be 
split into two steps [IZlliniElllTTllHn], the first reducing the matrix to a block condensed 
form (banded Hessenberg, symmetric banded or banded, as is relevant) [TT1[6T1[8T], and 
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the second further reducing it to Hessenberg, tridiagonal or bidiagonal form [2T | [77 t[78l[80] . 
Almost all operations of the first step are performed using M-M operations. Though the 
modified first stage takes more flops, its richness in M-M operations makes it more efficient 
on machines with multiple levels of memory. 

The first steps of these reductions, namely the banded Hessenberg and related reduc- 
tions, are discussed in Chapter 3. 

Reduction of a nonsymmetric matrix to banded Hessenberg form We study the 
slab based algorithm of ^58j, and analyse it for its I/O complexity for varying values of M 
and A^, for a given value of t, the desired bandwidth. This algorithm assumes that the slab 
size k = t. We generalise this into an algorithm with k not necessarily the same as t. We 
find that the algorithm performs the best when k = min{t, \/M}. 

We also investigate the I/O complexity of some tile based algorithms pT t lM t lQO]. 

These algorithms, typically choose t as the tile size, and also at any time, maintain only a 
small number of tiles in the main memory. If t ^ VM, then most of the main memory stays 
unused. We propose an algorithm that allows the number of tiles kept in the main memory 
to depend on the tile size. We also propose an algorithm for t > v^M, which partitions the 
matrix into tiles of size V^M/S; this gives a constant factor improvement over the choice 
of t as the tile size. We conclude that banded Hessenberg reduction can be achieved in 
0{N{N - tf/Bmm{t, a/M}) I/Os using the tile approach, when {N - t) = e{N). 

Reduction of a symmetric matrix to symmetric banded form We recall the stan- 
dard slab based algorithm for the problem, and draw conclusions similar to the above. Our 
tile based algorithms for banded Hessenberg reduction work for symmetric band reduction 
too, and perform about half as many I/Os. 
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Reduction to banded form We describe some known slab and tile based algorithms 
for the problem and analyse them for their I/O complexities, as in the cases of banded 
Hessenberg and symmetric band reductions. 

Reduction from banded Hessenberg form to Hessenberg form, and 
related problems 

These are studied in Chapter 4. 

We recall an unblocked and a blocked direct Hessenberg reduction algorithm, and 
show that they both have an I/O complexity of 0{N'^ /B). For the banded Hessenberg 
to Hessenberg reduction we propose two algorithms. The first one requires 0{N^/B) 
I/Os, but the second one, an extension of the first using tight packing of bulges requires 
only O ^^-^ + I/Os when t, the bandwidth, is at most a/M. Combining this with 
the results of Chapter 3, we show that the Hessenberg reduction can be performed in 
0{N^ /\/MB) I/Os, when is sufficiently large. This matches the known lower bound on 
the data movement [13] . 

We recall unblocked and blocked tridiagonal reductions, and show that they have an I/O 
complexity of 0{N^/B). For the symmetric banded form to tridiagonal form reduction, 
we show, the known algorithms take 0{N'^t/B) I/Os. Combining this with the results 
of Chapter 3, we show that the tridiagonal reduction can be performed in 0{N^/\/MB) 
I/Os, when is sufficiently large. This matches the known lower bound on the data 
movement [T3] . 

We recall unblocked and a blocked bidiagonal reductions, and show that they have an 
I/O complexity of 0{NP mm{N, P}/ B), when applied on an x P matrix. For the 
banded to bidiagonal reduction, we show, the known algorithms take 0{mm{N, P}H)/ B) 
I/Os, when t is the bandwidth. Combining this with the results of Chapter 3, we show 
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that the bidiagonal reduction can be performed in O y — j I/Os when the matrix 

is sufficiently large. This matches the known lower bound on data movement [13] . 

The generalised eigenvalue problem 

Chapter 5 deals with the generalized eigenvalue problem. We analyze both the unblocked 
and blocked algorithms for Hessenberg- Triangular reduction using Givens rotations [TCTjlSB] 
and show that their I/O complexity is 0{N^/B). We then analyse the various known tile 
based algorithms [STlIZn] for reducing a matrix pair to banded Hessenberg- Triangular pair 
with bandwidth t. Then we study the reduction of a banded Hessenberg- Triangular pair 
to Hessenberg- Triangular pair [371[70], and obtain new algorithms analogous to those in 
Chapter 4. Then we put the two steps together, and find, for given values of M and N, 
the band width t that will minimise the I/O cost of the two step reduction. For sufficiently 
large matrices, our reduction matches the known lower bound on the data movement [13] . 

The QR and QZ algorithms 

The QR algorithm solves the standard eigenvalue problem. Analogously, the QZ-algorithm 
solves the generalised eigenvalue problem. Both are iterative algorithms. Chapter 6 deals 
with both of them. We analyse the multishift QR algorithm of [I2] ; this algorithm chases 
mxm bulges, where m is the number of shifts, using both matrix- vector and matrix-matrix 
operations. We also investigate the small-bulge multishift QR algorithm [25] which was 
proposed to avoid the phenomenon called shift blurring. We propose a tile based algorithm 
that under certain conditions of the number of shifts has better seek and I/O complexities. 
We prove analogous results for the QZ algorithm [69j too. 
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1.4 Preliminaries 

Most of the algorithms in this thesis, invoke matrix transposition and multiphcation. 
Therefore, we analyse the well-known cache-aware algorithms for these problems here 
[68|[Tn7tfTn8]. 

A matrix is said to be in row (column) major order if its rows (columns) appear con- 
tiguously on the disk. 

1.4.1 Matrix Transposition 

The standard memory-hierarchy aware matrix transposition algorithm divides the input 
and output matrices into tiles so that two tiles fit in the main memory together, and 
transposes the input tiles one at a time. Formally, on a. P x Q matrix E: 

Algorithm 1.1. Matrix Transposition 
Input: Matrices E eM.^''^ and F eR'^''^ 

Output: F = E^ 

Tet s = ^/M/2; 

for i = 1 to \P/s~\ do 

for j = 1 to \Q/s] do 
F ■ = E^ - 

endfor 
endfor 

Assume that B < P,Q, Also assume that E and F are in row-major order. The 

data movement amounts to 2PQ. Therefore, the I/O complexity T < + S where S is 
the number of seeks. 

We now estimate the number of seeks. 

• P,Q < a/M/2. The two matrices fit in the main memory together. Read E in, 
compute F and write F. If a seek is needed between two rows, then S = P + Q. If 
rows follow one another contiguously, then S = 2. 
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• P < a/M/2 and Q > yjMjl. Divide E and F into slabs of width s each, E 
vertically and F horizontally, so that a slab each of E and F will fit in the main 
memory together. Let q — [Q/sJs and Xq — Q mod s. If a seek is needed between 
two rows, then 5 = f(P + s) + P + Xg = ^ + P + (5. 

If rows follow one another contiguously, then 5" = 2(^P + 1) + P+1= ^ + P + g/s+l 

• g < and P > Similar to the case above. Interchange the roles of P 
and Q. 

• P,Q > ■\jM/2. Divide E and F into tiles of width s each, so that a tile each of E and 
F will fit in the main memory together. Let q — [Q/sJs, Xq — Q mod s, p — [P/s\s 
and Xp = P mod s. In this case, it does not matter whether the rows follow one 
another contiguously. 5" = f | (2s) + | (s + Xp) + f (s + Xq) +Xq + Xp^ + p + g. 

If B divides s, and s divides both P and Q then, the seeks will not cause additional 
1/Os, and the I/O complexity would be exactly 

When P^Q^N,T <^ + 2N. 

Table 1.1: The seek complexity of transposing & P x Q matrix E into a Q x P matrix F, 
when both E and F are in row major order. If a seek is needed between two consecutive 
rows the seeks are as in column #1. Otherwise, the seeks are as in column #2. 



Seeks for E, F in RMO 


#1 


#2 


P,Q<^yM/2 


P + Q 


2 


P < y/M/2 and Q > y/M/2 


^ + P + Q 


^ + P+2 + l 
s s 


Q < y/M/2 and P > y/M/2 


^ + Q + P 


^ + Q + f + 1 


P,Q> VM/2 


Pg+Qp + P + Q 


+ p + Q 
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1.4.2 Matrix Multiplication 

Here the input and output matrices are divided into tiles so that three tiles fit in the main 

memory together |100[ll07|ll08j . Formally, on a P x Q matrix E, and Q x R matrix F: 

Algorithm 1.2. Matrix Multiplication 

Input: Matrices E G R^'''^ , F G M'^xi? q ^ ^pxR 

Output: G = E*F 

Let s = ^JWfl; 
for i = 1 to \P/s~\ do 
for j = 1 to \R/s~\ do 
Gij = 0; 

for k = 1 to \Q/s~\ do 

Gij+ = Eik * Fkj; 
endfor 
endfor 
endfor 



Assume that B < P,Q, R, a/M/3. (Also, assume that E, F and G are in row-major order; 
while the elements of a row are contiguous, a seek may be necessary between rows.) Let 
q = ilQ/s] - l)s, Xq = Q - sq, r = {\R/s] - l)s, Xr = R - sr, p = {\P/s] - l)s 
and Xp = P — sp. Let D denote the amount of data moved, S the number of seeks, and 
T = D -\- S the I/O complexity. 

• P,Q,R < a/ M/3. All three matrices fit in the main memory together. Read E 
and F in, compute G and write G. The data moved is -D = PQ + QR + PR. The 
number of seeks S = 2P + Q or 3, depending on whether a seek is needed between 
two consecutive rows or not. T < pQ+Q^+p^ j^2P + Q. 

• P,Q < v^M/3 and R > x^ = P. Divide F and G into tiles of width 
s each, so that a tile each of F and G will fit in the main memory along with 
E. F is read in and G is written out tile by tile. We have D = PQ + QR + PR. 
S = p+(i + i)(p + g) = 2P + Q + ^^ otS = l + (^ + l)(P+g) = 2P + Q + ^^, 

14 



CHAPTER 1 



Introduction 



depending on whether a seek is needed between two consecutive rows or not. Thus, 

rjn ^ PQ+QR+PR j^2P + Q -\- 

• P,R < and Q > Divide E and F into tiles of width and height 
respectively of s each, so that a tile each of E and F will fit in the main memory 
along with G. Read E and F tile by tile, and write G. We have D = PQ + QR+PR. 

S = P+{!l + l){P + s) = 2P + q + s + ^. S^l + {i + l){P + s) = 2P + q + s + ^. 
Thus, T < PQ+Q^R+RP + P1 + 2P + Q + S 

• Q,R < a/ M/3 and P > a/ M/3. Divide and G into tiles of height s each, so that a 
tile each of E and G will fit in the main memory along with F. Read F. Read E and 
write G tile by tile. We have D = PQ+QR+PR. S = Q+(f+l)(s+s) = 2p+2s+Q 
or S = 2p + 2s + 1, depending on whether a seek is needed between two consecutive 
rows or not. We have T < pQ+Q^+^p + 2p + 2s + Q. 

• P,Q> and R < y/M/3. Divide E into tiles of size sxs, F and G into tiles of 
height s each, so that a tile each olE, F and G will fit in the main memory. We have 
D = %sQ + QR) + {xpQ + QR) + PR=^ + PQ + QR + PR. ,5 = f (s + (f + + 
Q) + Xp+{^^ + l)xp + Q = 2P + Q + 22±f2 . If the rows follow each other contiguously, 
then g in the above becomes + Thus, T < p9R^pM±p9+QR^2P + Q + ^Q^. 

• Q,R > y/M/3 and P < a/m/S. Divide F into tiles of size s x s, E and G into 
tiles of width s each, so that a tile each ol E, F and G will fit in the main memory. 

We have D = 1{PQ + Qs) + {PQ + Qxr) + PR = ^ + PQ + QR + PR. S ^ 
(P + P(f + 1) + g) + 1) = Qr+^Pj+Pi + ^ + Q + 2P. S remains the same even if 
row seeks are not needed. Thus, T<^ + ^Q+Qf+^« + Qr+2Pr+Pg ^P<r^Q^2P 

' — sB B s ^ 

• P,R> y^M/3 and Q < y^M/3. Divide G into tiles of size sxs, E and F into tiles of 
height and width respectively of s each, so that a tile each of E, F and G will fit in the 
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main memory. We have D = ^^^2Qs+^^{xpQ+Qs) + ^{xrQ+Qs) + {xpQ+XrQ)+PR = 
PQl±pQR ^QR + pQ^pji, s = £^(2s + Q) + ^^{2xp + Q) + f(2s + Q) + 2xp + 
Q = ^ + ^Pr+Qr+pQ + 2P + Q. If rows follow each other contiguously, then S is 

pQr+pr I Pr+Qr+pQ+r+p i p i Q i ^ ThuS T < -^Q^+PQ^ _|_ PQj" _|_ "iPr+Qr+pQ . 2p i Q 

• P,Q,R> ^M/3. Divide E, F and G into tiles of size s X s, so that a tile each of E, 
F and G will fit in the main memory. We have D = ^^2Qs + ^{xpQ + Qs) + ^{xrQ + 
Qg) + (xpQ + XrQ) + PR= ^Q^'+pQ^ + gi^ + pg + pr, ^ = ^(i + i)((| + i)s + g) + 
(^ + l)((f + l)xp + g) + P(^ + l) = gOzyg + 2Pr+Qr+pQ+Pg ^2P + Q. ^ remains the 
same even if row seeks are not needed. Thus, T < pQ^+p^P- -\- pq+QR+pr _|_ pQr+Pqr _|_ 

2Pr+Qr+pQ+Pq _^ 2P + g 

When P = Q = R = N, and s divides A^, there are only two cases: 

• N < ^/mJI. T <^ + 3N. 

• N > v^M/S. We have D = ^ + iV2. 5 = ^ + ^ - 2N. Thus, T<^ + § + 

s 

In the above we assume that the matrices are in row major order. The case of column 
major order can be analysed similarly. The following Table [L2] summarises the asymptotic 
seek complexities for the various cases. 

In general, matrix multiplication takes q ^ _[_ pq+QR+RP _[_ g _[_ p^ seeks if the matrices 
are in CMO. 

1.5 Organisation of this Thesis 



In Chapter 2 we study QR decomposition. Chapter 3 is devoted to banded Hessenberg 
reduction and related problems. Chapter 4 deals with the reduction of banded Hessenberg 

16 



CHAPTER 1 



Introduction 



Table 1.2: The asymptotic seek complexity of multiplying a P x Q matrix with a Q x R 
matrix, when all matrices are kept in row major order (RMO) or all are kept in column 
major order (CMO) 





RMO 


CMO 


P,Q,R<s 


P + Q or 1 


Q + Rot1 


P,Q <s and R > 


s 


{P+Q)R 
s 


R 


P,R< s and Q > 


s 


Q 


Q 


Q,R< s and P > 


s 


p 


P{Q+R) 
s 


P,Q > s and R < 


s 


PQ 

s 


PQ 

s 


P,R> s and Q < 


s 


PR 

s 


PR 

s 


Q,R> s and P < 


s 


QR 
s 


QR 
s 


P.Q.R >s 


PQR 


PQR 



form to Hessenberg form. In Chapter 5, the generalised eigenvalue problem is studied. In 
Chapter 6, we study the QR and QZ algorithms. 
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2.1 Introduction 

In this chapter, we study the QR-decomposition problem. For A G M^^^, the problem 
seeks to find an orthogonal matrix Q G M^^^, and an upper triangular matrix R G R^^^ so 
that A = QR. QR-decomposition is a classical factorization method like LU and Cholesky 
decompositions. It is useful in solving the least squares problem, the Eigenvalue prob- 
lem and systems of linear equations. A QR decomposition can be obtained using Givens 
rotations, orthogonalisation via Gram-Schmidt and Modified Gram-Schmidt methods, or 
Householder transformations [58llll5j . Stability of the system, dimensions of the matrix, 
architecture of the machine and how the factorization is to be subsequently used are some 
of the factors that influence the choice [6^ . 

We focus on Householder based methods for QR-decomposition, because these exploit 
locality of reference better than rotation based algorithms [2T | [78 1 179]. and therefore are 
more amenable to OOC computations. 

The traditional QR decomposition based on Householder transformations is rich in 
matrix-vector, and not matrix-matrix operations. Efficient parallel and OOC computations 
require algorithms rich in the latter. Exploiting data locality to recast algorithms in terms 
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of matrix-matrix multiplication has received much attention [7t[23 |l39ll^ H3 t [96 |ll01j . For 
QR decomposition using Householder transformations, slab based [7l[23l|9ll|96] and tile 
based [28| IM] algorithms have been suggested. A slab based algorithm reads a slab of 
columns into the main memory at a time. A tile based algorithm partitions the matrix 
into a number of square tiles, and processes one tile at a time. 

We analyse some QR decomposition algorithms for their I/O complexity. We show 
that the traditional algorithm based on Householder transformations jll5] has an I/O 
complexity of 0(A^Pmin{iV, P}/B). 

The slab based algorithm of |96] is analysed. We choose this algorithm because of 
its simplicity. (There have been slab based algorithms that build on this. Recursive 
in-core algorithms that partition the matrix vertically are given in [IHIHQ]. An OOC 
implementation of these is given in [9lj.) We consider an implementation of the algorithm 
where matrix multiplication is done using the blocked algorithm of [58]. We find that the 
I/O cost of this implementation is of the same order as that of matrix multiplication for 
most choices of the problem parameters, and that the optimal choice for slab width is not 
necessarily the number of columns that can be fit in the memory. 

We also analyse the tile based algorithm of [64J, and show that its I/O complexity 
is of the same order as that of matrix multiplication. (A tile based algorithm that uses 
techniques similar to those of |6l] has been designed for multicore architectures [28] •) 

We also present a new cache-oblivious QR-decomposition algorithm, and its analysis. 
This algorithm is useful when Q needs to be reported explicitly. Its I/O complexity is of 
the same order as that of the tile based algorithm. 
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2.1.1 Organisation of this Chapter 

The rest of the chapter is organised as follows: Section 2 presents some prehminaries. Sec- 
tion 3 analyses the traditional QR-decomposition that uses Householder transformations. 
Sections 4 and 5 analyse a slab based and tile based algorithm respectively. In Section 6, 
we present a new cache-oblivious QR-decomposition algorithm and analyse it. 

2.2 Preliminaries 

If w is a unit vector in R^, then there exists a symmetric orthogonal matrix e E^^^, 
called the Householder of u, such that QuU — —u and, for any vector v orthogonal to 

QyV = V. 

Here u can be thought of as characterizing the {N — l)-dimensional hyperplane that 
passes through the origin and is orthogonal to u. Then, for any x, Q^x would be the 
reflection of x in this hyperplane. It is easy to see that Qu — {I ~ 2unF) satisfies the 
requirement. If \\u\\2 7^ 1, then let u = jnjr, the unit vector in the direction of u, and 



For any two vectors x and y with equal norms, there exists a u such that Q^x — y. This 
can be shown as follows: Let u = x — y and v = x + y. As {x — y){x + y) = | I2 — 1 I2 = 0) 
u and V are orthogonal. So, QuU — —u and QuV — v. But x — {x + y)/2 + {x — y)/2 — 
{u + v)/2. Therefore, QuX — Qu{v + u)/2 — (v — u)/2 — y. In particular, for a given 
X = [xi ■■■ XnY 1 let cr = sign(a;i). | |x| I2, and y = [—a ■■■ 0]"^. Clearly, x and 
y have equal norms, li u — {x — y) — [xi + a X2 ■■■ xjv]^, then QuX — y. Also, 
I \u\ I2 = {xi + gY + x\ + ... + x\ — I I2 + 2xxa + — 2a^ 2x^0 — 2aux. Therefore, 
Qu = {I - {l/(JUi)uvF). 

Thus, we have the following algorithmic result. 
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Algorithm 2.1. Given vectors u andv, QuV can he found in 0{N) operations and 0{N/B) 
I/Os. 

Proof. QuV = (f — j^^^uu^v). u^v can be found by making a pass each through v and u; 
\\u\\2 can be found on the fly; Let a = j^^^u^v. Now, v — au can be found by scanning u 
and v a second time. That totals to AN/B 1/Os, without counting the N/B 1/Os needed 
to write the result. Also, the number of seeks needed is 0{N/M). □ 

Algorithm 2.2. Given a vector x, a vector u such that QuX is along dimension 1 and has 
a sign opposite to that of xi can he found in 0{N) operations and 0{N/B) I/Os. 

Proof. maxj{a;j} can be found from a single scan of x. Dividing of every Xj with this, 
calculating of ||a;||2, o" = sign(a;i). ||a;||2, and u = [xi+a X2 ■ ■ ■ x^Y can all be done with a 
further scan. That totals to 2N / B 1/Os, without counting the N/B 1/Os needed to write 
u normalised to have a first component of 1. Also, the number of seeks needed is 0(1). □ 

2.3 The Traditional Algorithm for QR Decomposition 
using Householder Transformations, and its I/O 
Complexity 

If A G R^^^, then there exists an orthogonal matrix Q G R^^^ and an upper triangular 
matrix R G R^''-^ such that A = QR [115]. This can be proved by induction on N 
and P. If A = 1, then A is already upper triangular; A = [l]y4. If P = 1, then A 
is a vector. By the discussion above, there exists a Householder transformation Q such 
that Q^ A = R where vector R is upper triangular; all its components, except the first, 
are zero. Thus, A = QR. These form the basis. Suppose the statement is true for 
all matrices with fewer rows or columns than A. Choose k < P. li k > N, then let 
A = {Ai,A2), where Ai has k columns; 1 < A; < P. Then, by the hypothesis, there exists 
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an orthogonal matrix Qi e R-^^-^ and an upper triangular matrix Ri e R-^^'^ such that 

Ai = QiRi. Then say, R = Q^A = {QjAi QJA2) = {Ri ^2)- R is upper triangular. 

All A12 \ ^,1^^^ 4_ ^ rnkxk 



and therefore A = QiR. li k < N, let A ^ j , where An e R'''"'. Then, 

by the hypothesis, there exists an orthogonal matrix Qi e R^^^ and an upper triangular 
matrix R^= (^^^^ with Ri G R^^^' and Ri G R^^'^ such that ( ^^1 ) " ( ^0 ' 
Then say, B^QjA^ (^^^' ^'^^ ^. Let B22 = Q2R2. Then, with Q2 = ^ ) , 
Ql f \ _ / ^1 Bi2\-^ .g ^ppgj. triangular. That is Q^QjA = i?. Hence, 

V U ±>22 / \ it2 / 

A — QR, where Q — Q1Q2 is an orthogonal matrix. 

The traditional algorithm for QR-decomposition stems directly from this result. 

Algorithm 2.3. For any A G R^^-^, the QR-decomposition of A can be achieved in 
0{NPX) operations andO{NPX/B) I/Os andO{NPX/M) seeks, where X ^ mm{N, P}. 

Proof. If N — 1 there is nothing to do. If A?" > 1, Let A; = 1 in the proof of the above lemma. 
The Householder Qi can be found in 0{N) operations and 3N/B I/Os. We assume that u 
that characterises the Householder overwrites the first column of A except for An', since u 
is normalised, its first component need not be stored. The matrix B can be calculated by 
applying Qi to each column of A in a total of 0{NP) operations and 5N[P — 1)/B I/Os. 
Now the algorithm recurses on the (A^ — 1) x (P — 1) matrix ^22- Clearly, the algorithm 
hits the basis after X levels of recursion. 

T{N, P) < I EHo-' Ef=o (N - iB) {P-iB- j) = O(^) 

When N> P^ uj{B), the number of I/Os is ^{N - f + o{N)) 

When P> u(B), the number of I/Os = ^(P - f + o(P)) □ 
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2.4 The Slab Approach 



Here we analyse an implementation of the algorithm of [96], which is based on the following 
"Compact WY representation" : A product = Qi ■ ■ ■ Qk of k < N Householders, where 
each = (/ - /3iUiuf) G M^^^, can be written in the form (/ + YuTkY^) for a Yfc G M^''^ 
and an upper triangular G M^^'^. 

This can be proved by induction on k. For the basis, let = 1. Pi = Qi = {I—f3iUiuJ) = 
(/ + Ui[— f3i]uj) . Therefore, choosing Yi = Ui and Ti = [—Pi] will do. Say, we have Y^^i 
and Tk-i so that P^-i = Qi ■ ■ ■ Qk-i = {I + Yk-iTk-iYl[_-i) . Let Yk = {Yk-i Uk), and Tk = 



An implementation of the algorithm of |i96j is described now: Let A be the input N x P 
matrix. Divide it into vertical slabs each consisting of k columns, where < P is a 
parameter to be chosen. Let X = min{A^, P} and y = min{A^, k}. Let us call the leftmost 
slab Ai and the rest of the matrix A2. That is, A = {Ai A2). Perform a QR decomposition 
of Ai using the traditional algorithm. Also, on the fly, aggregate the Householders that 
the algorithm produces. While processing the z-th column of the slab, Tj can be computed 
from Yi_i, Tj_i and Qi in 0{iN) operations, 0{iN/B) I/Os and 0{iN/M) seeks; Y^ is 
obtained from Yi^i by merely adding Ui as an extra column. The cost of aggregating 
y Householders would therefore be 0{y'^N) operations, 0{y'^N/B) I/Os and 0{y'^N / M) 
seeks. Pre-multiply A2 with this aggregate, and if < fc, then halt. Otherwise, recurse 
with the lower N — k rows of A2. Assume that a matrix multiplication is done using the 
blocked algorithm of [58], when neither matrix fits in the main memory. 

The algorithm chooses a width k for the slab. The I/O complexity of the algorithm 
will depend on the choice. We analyse the I/O complexity of the algorithm for the various 
possible choices of k: 

In the tables below, a, /3, and 7 stand for the following operations in that order: QR- 




where p = -(3k and z = -j3kTk-iYj:_^Uk. Then, I + YkTkY^ 



Pk-lQk — Pk- 
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Table 2.1: Blocks of data moved in scenarios A, B, C and D. 



Scenario 


a 


P 


7 


V 


Total I/Os 


A 


Nk^/B 


Nk^/B 


kNP 1 NP 
BVIi ^ B 


X/k 




B 


N^k/B 


N^/B 


N^P 1 NP 
BVM ^ B 


1 




C 


Nk/B 


Nk/B 


NP 
B 


x/k 


NPX 
kB 


D 


Nk/B 


N^/B 


NP 
B 


1 


NP 
B 



Table 2.2: Seeks performed in scenarios A, S, C and D. 



Scenario 


a 




1 


V 


Total seeks 


A 


Nk^/M 


Nk^/M 


kNP 1 NP 1 p 
M ^ ^ ^ 


x/k 


NPX 1 NPX 1 PX 
M ' kVM ^ fe 


B 


N^k/M 


N^/M 


N^P , NP , p 
M VM^ 


1 


W^P 1 1 p 


C 


k 


k 


P 


X/k 


PX 


D 


k 


N 


P 


1 


P 



decomposing a slab, aggregating the Householders generated during the QR-decomposition 
of a slab, and multiplying the aggregate of a slab with the rest of the matrix, v stands 
for the number of slabs to be processed. Table 12.11 and 12.21 summarise, respectively, the 
number of blocks of data moved and the number of seeks performed by the operations 
under the scenarios A, B, C and D listed below. The total is calculated as v{a + /3 + 7) in 
each case. For brevity, we drop the order notation from the table. 

Scenario A: M < Nk (the slab does not fit in the main memory) and k < N (the slab 
is tall). Here k < N,P] so, y = k. 7 is dominated by the cost of calculating the product 
YTY^A, which is a multiplication chain of Nxy,yxy,yxN and N x P matrices. It is 
easy to see that the cost of the multiplication is as given, once we note that is available 
in row-major order. 

Scenario B: M < Nk (the slab does not fit in the main memory) and N < k < P (the 
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slab is fat). Here, X — y — N. Y and T are both N x N matrices. 

Scenario C: Nk < M (the slab fits in the main memory), and k < N (the slab is tall). 
Here k < N, P; so y — k. The QR decomposition and the aggregation of the Y matrix can 
be done in internal memory. YTY^A can be calculated one column at a time. 

Scenario D: Nk < M (the slab fits in the main memory), and N < k (the slab is fat); 
N < k < P; so X = y = N, and N'^ < M. Y and T are both N x N matrices. So is 
YTY^ , and thus it can stay in the main memory. Read A column by column, update and 
write back. 

Now we find the choices of k that, for the given values oi N, P and M, minimise the 
I/O complexity under various conditions. In all but two of the cases below, with "the tall 
cache assumption" , the number of blocks of data moved dominates the number of seeks 
performed. In addition to the optimal I/O bounds and the associated seek bounds, we also 
calculate the I/O and seek bounds for k — y/M. 

UN > M, there is no choice of k for which the slab will fit in the main memory. If, 
in addition, N > P, then N > k and X = P; the slabs are tall; we have Scenario A; the 
I/O cost is 0{^{k + j + -^))- Clearly, the choice that minimises this is A; = y/P, which 
costs 0(^(^ + ^)) I/Os, and 0(^(^ + ^)) seeks. 

If we were to choose k — -\/M, the number of I/Os would be 0{^{^^=-)), and the 
number of seeks 0{^^^). 

If iV > M, but N < P, a choice of k > N is also possible. If /c G [N, P], then we have 
Scenario B, and the I/O cost is 0{^^ + §{Nk + P)); k ^ N is the best choice. If k < N 
then we have Scenario A, and the cost is 0{^{k + ^ + '^))- The best choice of k from 
[1, N] is y/P or depending on whether a/P < N ot not. That is, the best k is \/P, 
i^^/P<N [I/O: 0(^(^ + ^)) = O(l^); Seek: 0(^)] and N otherwise [I/O: 
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0(f + F + A)) = O(^); Seek: + ^) = 

With A; = VM, the number of I/Os is 0{^{^)) = 0{^%), and the number of 



seeks is O(^). 



li N < M and < M/A^, then a e [A^, M/7V] can be chosen; Scenario D; blocks of 
data moved: 0(7VP/5); seeks: 0{P). In this case, seeks may dominate the I/O complexity. 



If < M and M/N < N, {M/N < VM < N < M), then the choice of k can satisfy 
(i) k < M/N, (ii) k e (M/N, N], or (iii) A; > A^. 

U P < N, then (iii) is ruled out. 

If M/N > P, the matrix fits in the main memory; Scenario C; 0{NP/B) blocks of data 
are moved; seeks: 0{P). Here too, seeks may dominate the I/O complexity. 

Now assume that M/N < P. That is, M/N < P < N < M. li k < M/N, we have 
Scenario C, and so the cost is 0{j^). The best choice for k in [1,M/A^] thus is M/N; 
I/O cost: O(^); seeks: O(^). For k > M/N, we have Scenario A; the I/O cost is 
0(^(-^ + k + j)), which minimises at A; = y/P, and grows monotonically thereafter. 

If ^ > M/N, then the minimum I/O in [M/N, P] is given by A; = ^/P and is 0{ ^^ ) — 
0{ ^^^ ). Thus k — VP is the best choice in Interestingly, here a slab size that 

does not fit in the main memory turns out to be the best choice. The number of seeks in 
this case is 0{^^). 

If y/P < M/N, then the smallest value in [M/N, P] is provided by A; = M/N, which 
gives a cost that is worse than 0{ ^^^^ ). 

A choice of A; = \/M makes sense only if P > \/M > M/N; even then the cost will be 

Let AT < M and M/N < N < P. First assume that P < M. That is, M/N < N < 
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P < M. lik < M/N (Scenario C), then the I/O cost is O(^). The best choice for k 



in [I, M/N] thus is M/N; cost: O(^). For k G [M/N,N] (Scenario A), the I/O cost is 
0{^{^^ + + j)), which minimises at k = y/P, and grows monotonically thereafter. 

If G [M/A^,A^], then the least I/O in [M/N,N] is given by = /P and is 
0(^(^ + ^)) = 0(^^); this is O(l^). Thus A; = is the best choice in 
[1,N]. Seeks: O(^). 

If < M/A^, then the least I/O for A; G [M/A^, A^] is provided by A; = M/N, which is 
worse than 0{^). Seeks: O(^). 

If > A^, then the best I/O bound is 0{^%); the choice A; = A^ gives this. Seeks: 



'BVM 

O(^). 

Now consider k G [A^,P]. Scenario B. The cost is 0{^^ + f (P + A^A;)). The best k 
is A^; cost: 0{^^= + ) . This does not improve on the minimum of [1, A^], in any of 

the cases. 

A choice of A; = causes the cost to be 0(|^). Seeks: O(^). 



It now remains to consider the case of M/N < VM < N < M < P. li k < M/N, then 
the cost is 0(§). The best choice for k in [1,M/N] thus is M/N; cost: 0(|^). For 
k = VM G [M/N,P], the I/O cost is O(^) = O(^). (Seek: 0(^).) A detailed 



analysis shows that the best choices for k are a/P and A^ when \fP G [M/N,N] and 
> A^ respectively; but the I/O and seek bounds for these choices are of the same order 
as those of A; = \fM asymptotically. 

The cost for a k from [A^, P] is 0{ + ^(P + A^A;)). The best choice for k is A^; cost: 
O ( + ^-^^^^ ) . This does not improve on the minimum of [1 , A^] . 



The above findings are summarised in Table [231 In particular, for a square matrix with 
N = P, the I/O and seek complexities are as given in Table [23J 
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Tabic 2.3: Summary of the analysis of the slab based algorithm 





k 


I/Os 


Seeks 


N>M, N>P 


VP 




NP^ / 1 _| ]_\ 


N>M, N<P, y/P<N 




N^P 

bVm 


N^P 
M 


N>M, N<P, Vp>N 


N 


N^P 


N^P 
M 


N<M, §>N 


M 
N 


NP:p 

B ^ ^ 


P 


N<M, §<N, P<N, f>P 


P 


NP 
B 


P 


N<M, ^<N, P<N, f <P, f <VP 


VP 


NP^ 

By/P 


NP^ 

■Jp-Jm 


N<M, §<N, P<N, §<P, §>y/P 


M 
N 


N^P^ 
EM 


M 


N<M, §<N, P>N, P<M, \/P<f 


M 
N 


N^P 
BM 


N^P 
M 


N<M, §<N, P>N, P<M, ^e[§,N] 


VP 


N^P 

bVp 


N^P 

VpVm 


N < M. ^- < N. P > X. P < M. xfP y- N 


N 


iivMi 


N-P 

M 


N<M, §<N, P>N, P>M, y^e[f,N] 


VP 


N^P 

bVm 


N^P 
M 


N<M, ^<N, P>N, P>M, y/P>N 


N 


N^P 

bVm 


N^P 
M 
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Table 2.4: I/O and Seek complexities of the slab based algorithm on a square matrix 





k 


I/Os 


Seeks 


N>M 






Ar3 
M 


N<y/M 


k e [N, M] 




N 


M2/3< N<M 




bVn 


Ar3 


^fM<N< M2/3 


M 
N 


BM 


Ar3 
M 



The I/O cost of the slab based QR-decomposition algorithm matches that of matrix 
multiplication in most of the cases. The best choice for slab width is not always the 
number of columns that fit in the main memory. Interestingly, a choice oi k = a/M seems 
to minimise the seek complexity at ^^/^ ■ However, it does not minimise the I/O cost. 

We conclude that existing slab based implementations may not have to be completely 
redesigned for good Out-of-core performance. If some of the elementary matrix operations 
in them (for example, matrix multiplication here) are handled I/O efficiently, optimal I/O 
performances might be achievable for most choices of the problem parameters. 

2.5 The Tile Approach 

Slab based algorithms like those of the last section are not scalable: the larger the row size 
of the matrix, the smaller the width of the slab that can be fit in the main memory. 
An alternative that has been suggested is the tile approach: partition the matrix into a 
number of square blocks called "tiles", and process the tiles one at a time. It has been 
observed that the tile approach results in more scalable out-of-core algorithms for Cholesky 
decomposition [63 t [93 t[T0T] and QR-decomposition [63] . Tile based QR decomposition for 
multicore architecture is discussed in [2H], using techniques similar to those of (M] . 
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Algorithm 2.4. QR decomposition using tile approach [64] 

Input: An N x P matrix A. 

Output: A = QR, where R is an N x P upper triangular matrix and Q is an N x N 
orthogonal matrix; Q is not output explicitly, but is expressed as a product of terms 
of the form I + YTY^ , where Y is an N x k matrix and T is a k x k upper triangular 
matrix. 



Choose t = 0(a/M); Choose k < t; 
Partition A into t x t tiles; 

Let Aij denote the tile that is in both the i-th tile row and j-th tile column; 
X = mm{N,P}; 
for i = 1 to X/t do 

QR-decompose{Aii) ; 

for j = i + ltoP/tdo 

QR- Multiply- from- left- l^Aa, A^j ) ; 

endfor 

for j = i + 1 to N/t do 
QR- Update{Aii, Aji); 
for / = i + 1 to P/t do 

QR-Multiply-from-left-2{Aji, An, Aji); 
endfor 
endfor 
endfor 



2.5.1 The Algorithm 

Here we analyse the QR-decomposition algorithm of [M] for its I/O complexity. A high 
level description of the algorithm is given in Algorithm 12.41 

Now, to analyse the algorithm, we assume that the input N x P matrix A has already 
been "tile-transposed" . (We define tile transposition of A as the problem of permuting its 
elements so that each tile is available contiguously in column-major order.) Note that if 
A is given in column- major order, then tile transposition can be done in 0{NP/B) 1/Os 
and 0{NP/t) seeks, if the tile size t > max{i?, VM}, because a tile can be processed in 
0{t^/B) 1/Os and 0{t) seeks. (The tile size is the number of columns (rows) in a tile.) 
Once we have found a ^/M x \/M tiling, for any t < \fM, a t x t tiling can be found in 
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0{NP/B) I/Os and 0{NP/M) seeks; read each a/M x \/M tile into the main memory, 
form a. t X t tihng and write back. Putting the above two statements together, we have 
that if a/M > B (the tall cache assumption), for any t, a t x t tiling of A can be found in 
0{NP/B) I/Os and 0{NP/{t + y/M)) seeks. 

We analyse for the following implementations, where A; < t is a parameter of the algo- 
rithm: 

1. The single tile implementation, in which a single tile and three and a half narrow 
panels of size t x k occupy the main memory; i.e. + lkt/2 < M. 

2. The two tile implementation, in which two tiles together with two and a half narrow 
panels of size t x k occupy the main memory; i.e. 2t^ + 5kt/2 < M. 

3. The three tile implementation, in which three tiles together with with a panel and a 
half of size t x k occupy the main memory; i.e. St"^ + 3kt/2 < M. The half panel will 
hold the T matrices and the full panel will provide space for computations. 

The parameter k can be chosen so as to minimise the number of I/Os. We now describe 
and analyse each procedure of Algorithm 12.41 

Function QR-decompose(74jj) 

Assume that the diagonal tile An is divided into vertical slabs of width k each. Let 
r = t/k. The invocation QR-decompose(y4jj) finds matrices Yi, . . . ,Yr, Ti, . . . , such that 
Rii = {It + YtTJyJ^) . . .{It + YiT-[Y^)Aii is upper triangular. This is done as follows: 

Read An into the main memory. Proceed in r iterations. At the beginning of the s-th 
iteration, the subdiagonal elements of the first {s — l)k columns of An are all zero. In the 
s-th iteration, calculate K, and Tg such that the subdiagonal elements of {It + KjTjy/')Ajj 
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in the first sk columns are all zero. Update An to {If + YsTj'Yj^)Aii. Store Yg in the 
subdiagonal part of the s-th slab of An. 

When all slabs are processed, arrange the upper triangular part of An in row major 
order, and write A^ back to the disk. 

The number of I/Os needed is 2t^/B; A^ is read and written; the T matrices are kept 
in the main memory. 

The number of seeks is even for the one tile case. 
Function QR-Multiply-from-left-l(^ji, Aij) 

If An QR-decomposes into QnRa, then a pre- multiplication of A by an orthogonal matrix 
obtained by replacing the i-th diagonal tile of In by Qn has the effect of making An upper 
triangular. This pre- multiplication also updates all tiles Aij with j > i. 

The invocation QR-Multiply-from-left-l(>lii, ^Ijj) achieves this update of Aij as follows: 

Read A^j into the main memory. Use the Y matrices in the lower triangular part 
of An and the T matrices that remain in the main memory after the invocation "QR- 
decomposc(Ajj)" to update A^j. When all updates are over, write A^j back in row major 
order. How the Y matrices are accessed will depend on how many tiles fit in the main 
memory. If only one tile fits, then the Y matrices are read in one after the other, and so 
the total number of I/Os is at most 2.bt^ / B. If two tiles fit, we can assume that An resides 
in the main memory, and so the total number of I/Os will be at most 2t^/B. 

The number of seeks is 0(1) even for the one tile case. 
Function QR-Update(Ajj, Aji) 

When An is upper triangular, the invocation QR-Update(Ajj, Ajj), finds a matrix Q such 
that in Q^A, every element of the {j, i)-th tile is zero, and (i, i)-th tile is upper triangular. 
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This is done over r — t/k iterations. The s-th iteration processes A\^' , the s-th topmost 



horizontal slab of width k of A,, . Consider the matrix C, 



Aji 



and T, 



be matrices such that ^h+t + ^ y ^ ( -^fe Yj^ )^ zeroes in the lowest t rows 

Ai. 



of its s-th slab. Then, 



l2t + 



{s-l)k \ 

Ik 
Ot-sk 



{s-l)k 



Ik 0. 



t—sk 



has 



A,, 



zeroes in the lower triangular part of its s-th slab. Set |^ ^" J to this matrix by updating 

A^^^ and Aji. The s-th slab of Aji is now all zero. Therefore Yg can be stored there. 

Let be the product 









h 







The above can be implemented as follows: Read Aji. Read An one horizontal slab at a 
time. For each slab, update the slab, write it back and overwrite the corresponding vertical 
slab of Aji with the Y matrix. The number of I/Os is ?>t^ /B, irrespective of whether one 
tile or two tiles fit in the main memory: one tile and a half are both read and written. 

The number of seeks is 0{t/k) in the one tile case, and in the two and three tile 
cases. 

If both An and Aij are upper triangular (We will encountered such an invocation in 
Chapter 3.), then the number of I/Os is 2t^ /B: two half tiles are both read and written. 
The number of seeks would be 0(1) even in the one tile case. 



QR-Multiply- from- left-2 (Aji, A-a, Aji) 



Here i < j and i < I. Suppose Qji 
is upper triangular. Then, 



. is a QR decomposition of , . 



^ Aji Aji 



Rii Bii 

Bji 



Au 

■ji 



where Ai. 
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where f I ~ QJi \ V ^^J"' QJi ^^'^ have been calculated by an invocation to 

V ^ji J \ ^ji J 

QR-Update(Ajj, Ajj). At the end of the invocation the Y matrices are stored in Ajj,. 

Bii and Bji can be calculated as follows: 

Read Aji. Read a horizontal slab from A^ and the corresponding vertical slab from Ajj,. 
Update the horizontal slab and Aji using the vertical slab, and write the horizontal slab 
back. When all slabs have been processed, write Aji back. If only one tile fits in the main 
memory, then the number of I/Os is 5t^/B: Aji and A^ is read and written; Aji is read. 

If Aji is upper triangular (We will encounter such an invocation in Chapter 3.) then 
the number of I/Os would be A.bt^ /B. 

If two tiles fit in the main memory, then Aji can be kept in the main memory, as / 
varies, and so the number of I/Os is At^ / B. 

The number of seeks is 0{t/k) if one or two tiles fit in the main memory; 0(1) if three 
tiles fit in the main memory. 

I/O complexity 

The I/O complexity for the one tile implementation is 

2 X/t ( Pit N/t ( Pit 

i=l y j=i+l i=i+l \ i=i+l 

^ -{^NPX - ZNX'^ - ZPX"^ + 2X^) + —{2NX - X^) + -^X 



6tB^ ' AB^ ' 12B 

where X — mm{N,P}. In particular, when N — P — X, the I/O complexity is |^ + 

^ + ^ where t ^ VM. 

The I/O complexity for the two (also, three) tile implementation is 

2 Xlt ( Pit Nit 

35 







34 


£.)) 


V 





CHAPTER 2 



QR Decomposition 



= -^{6NPX - 3NX^ - 3PX^ + 2X^) - ^{X^ - 2NX) + -^X 

In particular, when N ^ P ^ X, the I/O complexity is ^ + ^ + where t ^ y/M/2 
and t ^/^^/3 for two and three tile implementation respectively. 

We also require a number of I/Os to write the T matrices. We have t/k number of T 
matrices of size k x k. Since the T matrices are upper triangular, that amounts to roughly 
k/2 elements. There are X/t number of single tile QR decompositions and N — i number 
of two tile QR updates for i = 1, . . . , X/t — 1. So I/Os required to handle the T matrices 

Efi; # + eS'-' ^ = 5fe {NX - ¥ + f ) 

Clearly, from the above, the one tile implementation is the best of the three. 
Seek complexity 

If three tiles fit in the main memory, then the number of seeks is 0{ ^^^ ). If at most two 
tiles fit, then it is O(^). 

2.5.2 An 0(1) factor improvement for the one tile implementa- 
tion 

Wc suggest a small modification to the above algorithm that can improve the number of 
I/Os by a constant factor for the one tile case. 

After the invocation to "QR-decompose(Ajj)", keep An in the main memory, while 
"QR-Multiply-from-left-l(Ajj, ^y)" is invoked for each j > i. Read a vertical slab of A^j , 
update it and write it back in each iteration. Thus the invocation "QR-Multiply-from-left- 
l{Aii, Aij)" would take only ^ I/Os. The updated A^ is now written in column major 
order. This version of the procedure, however, will incur 0{t/k) seeks in the one tile case. 

This would necessitate a change in Procedure QR-Multiply-from-left-2, because now 
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the invocation "QR-Multiply-from-left-2(Ajj, An, Aji)" , would find that QR-Multiply-from- 
left-l{Aii, Ail) has left An in column major order. 

Suppose in QR-Multiply-from-left-2(Ajj, An, Aji), we keep Aji in the main memory, and 
bring a column each of An and Aji into the main memory in each iteration. All the Y 
matrices needed to update these columns are in the main memory now, and so are the 
columns. The updates of these columns are independent of those of the other columns. 
We can perform these updates and write the columns back. This reduces the number of 
I/Os performed by the invocation "QR-Multiply-from-left-2(y4jj, Aj/, Aj/)" to even for 
a one tile implementation. 

The I/O complexity, therefore, is 

2 X/t / P/t N/T / P/T \\ 

i=l \ j=i+l j=i+l \ l=i+l I I 

= -^(GNPX - 3NX'^ - SPX^ + 2X^) + —(2NX - X^) + —X 
3tB^ ' 2B^ ^ 6B 

In particular, when N = P = X, the I/O complexity is |^ + |§ + |^ where t ~ v^M. 
The number of seeks is O ( ^j^^ ) . 

2.5.3 Comparison of seek times for slab and tile approaches 

We compare the seek times of the slab based and tile based algorithms in Table 12.51 Order 
notation is omitted for brevity. Clearly, the tile approach is superior. 

2.6 A Cache Oblivious Algorithm for QR Decomposi- 
tion 

In this section, we present a cache oblivious algorithm for QR-decomposition with an I/O 
complexity of 0{ J^^ asymptotically the same as that of the algorithm discussed 

in Section [2751 
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Table 2.5: Comparison of seeks needed by the slab and tile base algorithms 





N < ^/M 


VaI <N < 


<N < M 


N > M 


slab 


N 


M 


Vnm 


M 


3 tiles 








JV^ 


M 


mVm 


M\/M 


M\/M 


1 or 2 tiles 


M 


Mk 


n'^ 

Mk 


Mk 



A cache oblivious QR-decomposition algorithm based on Givens rotations is proposed 
in [53j. That algorithm consists of two mutually recursive functions, / and e, where 
the function / QR-decomposes a block and the function e annihilates a block using an 
upper-triangular block as the pivoting block, and assumes that the input matrix is given in 
Morton order (also called bit-interleaved order or Z-order). Our algorithm uses Householder 
transformations, and is simpler in that it does not use mutual recursion. Also, we do not 
assume any input representation. However, if the input is in Morton order, our algorithm 
will run efficiently even without the "tall cache assumption" j5l] . 

Our algorithm uses the divide-and-conquer technique. First we present the algorithm 
for square matrices, and generalise it later. The algorithm consists of two functions "CO- 
QR-decompose" and "CO-QR-Update" . "CO-QR-decompose" QR-decomposes the input 
matrix using Householder transformations and returns the orthogonal transformation Q 
and upper triangular R. "CO-QR-Update" takes as input two blocks A and B of size 
N xN each, where A is upper triangular, and returns an orthogonal matrix Q G M^^^ and 
upper triangular matrix R G M^/^^^/^ so that ~^(^^^' "^^^ functions "CO- 

QR-decompose" and "CO-QR-Update" are described in Algorithm 12.51 and Algorithm 12.61 
respectively. 
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Algorithm 2.5. CO-QR-decompose(A) 
Input: An N x N matrix A. 

Output: An orthogonal matrix Q G M^^^ such that A = QR, where R G M^^^ is upper 
triangular and overwrites A. 



Step-1 if = 1 then return [1]; 

Step-2 Partition A into four tiles of equal size: 

Step-3 Qi = CO-QR-decompose^Aoo) ; 

Qi 



Step- 4 Let Qi — , „ J 

U lN/2 

Step-5 Q2 = CO-QR-Update{Aoo, Aw) 
Step-6 Let Q3 = Q1Q2; 

Aqi \ _ I ^01 
An ~^'[ An 



Step-7 ( ) = Qs 

^L 

Step-9 Return Q 



Step-8 Qi = CO-QR-decompose{Aii] 

In/2 



Q 



4 



Aqo Aqi 

Alo An 



2.6.1 I/O Analysis 

Let S{N) denote the worst case I/O complexity of an invocation "CO-QR-Update(A, B)" , 
where A and B are N/2 x N/2 matrices. The invocation spawns four recursive invocations 
each on a pair of N/ 4 x N/ 4 matrices, and 28 multiplications of two N/ 4 x N/ 4 matrices. 
(See steps 7, 8 and 13 of Algorithm EH) If A < Vm, then ^(A) = Q{N^/B). For 
A > v^M, we have the following recurrence relation: 

S{N) = AS{N/2) + 28V{N/A) (2.1) 

where V"(A) denotes the 1/0 complexity of multiplying two A x A matrices. When 
A G {VM, VM/2], S{N) = e(AV5) = e(-^); that is, there is a constant d such 



that S{N) < -f^. Moreover, there is a constant a such that ^(A) < [54]. For 



A > v^M, if (i > 7a/8, we can use induction to claim that S'(A) < -^yp, because then 



s(iv, < 'Ami + ^ 

bVm b^ eVm 
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Algorithm 2.6. CO-QR-Update(A,'Bj' 



Input: A,B£ M^/2x^/2 matrices, A is upper triangular. 



Output: 



A 
B 



= Q 



R 




; where R e ^^/'^^^/'^ is upper triangular, and Q e M.^^^ is orthogonal 



matrix. 



Step-1 



Step-4 
Step-5 

Step-6 
Step-1 
Step-8 
Step-9 



if N = 2, then compute the Householder transformation Q such that Q^A 
is of the form ^ | 

Return Q; 



Let A 



Step-2 Partition the matrices A and B into 



A 
B 



( ^00 





-82 
\ S30 



All 



-82 1 

S31 ) 



where ^00 and An are upper triangular and e ]gJV/4xJV/4^. 



Step-3 Q 



1 = CO-QR-Update{Aoo,B2o) 

Qoo Qoi 
Qio Qu 

( Qrin Qoi 



Partition Qi into 



Let Qi 



'Af/4 







Q{i 

Jjv/4 J 

Q2 = CO-QR-Update{Aoo,B3o); 



where Ql^ € R^/AxN/i. 







\ 



10 




Partition Q2 into 

( QI 



Qoo Qoi 
Qio 



Let Q2 







l-N/i 






'JV/4 



^00 




Let Q3 = Qi ■ Q2 

f Aoi\ ( Aoi \ 

All r> ^11 

R = ^3 • n 

-D2I -D2I 

V B31 y \ B31 / 

Qi = CO-QR-Update{Aii, B21) 



where Ql^ e 
\ 



xJV/4. 



=^01 






Qii J 



Set 



Partition Q4 into 

( In/a 




Let Qi 



\ 



?oi 



Step-10 

Step-11 Qg = CO-QR-Update{An, Bsi); 



Qfm Qoi 

Qio Qii 


*5oo Qoi 
Qti 




where Qj^ G 



xJV/4. 











In/A j 



Partition Q^ into 
I In/2 
Let (55= ^ 



Qio 



Step-12 

step-13 Return Q = Q3 ■ Q4 ■ Q5; 



V 



01 

*3io Qii 



N/2 



Qf 



00 




where Ql^ e 

Qfn 




xJV/4. 



10 



gfi y 
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Similarly, let T{N) denote the worst case I/O complexity of an invocation "CO-QR- 
decompose(A)", where A is a N x N matrix. If < a/M, then T{N) = Q{N'^/B). For 
> v^M, we have the following recurrence relation: 

T{N) = 2T{N/2) + S{N) + 8V{N/2) (2.3) 

See Algorithm [231 When G {VM, T{N) = Q^N"^ / B) = e(^); thus, there 



BVM' 

N3 



is a constant constant c such that T{N) < -ff=. For > VM, if c > A{d + a)/3 > 5a/2 



we can use induction to claim that T{N) < -^yjj, because then 



BVM 



2cA^/2 3 dN^ MiN2f cN^ , 

T(N) < ^ + = + ^ < = 2.4 

~ By/M By/M By/M ~ B^M 



2.6.2 The General Case 

Now we extend the algorithm for general matrices. Assume that the input is an A^ x P 
matrix A. 

U N < P, then divide A vertically into two slabs Ai and A2, A = {Ai A2), so that Ai has 
A^ columns. Perform a QR decomposition of Ai, Ai = QR, using "CO-QR-decompose". 
Then compute the product Q^A2. The I/O complexity of this procedure is 

T'{N, P) = T{N) + V'{N, N, P-N) 

where V'{P,Q,R) is the I/O complexity of multiplying a P x Q matrix and a Q x R 
matrix. If A^ < VM, then the I/O complexity is T'{N, P) = 0{NP/B). li N > Vm 
then, nN, P) = O (5^) + O = O {0,) 

( ^' \ 

If N > P, then partition A into P x P blocks from the top; A = '■ where each 

\ An/p J 

Ai is a PxP matrix. Perform a QR-decomposition of Ai by invoking "CO-QR-decompose". 
Then, for 1 < i < N/P, perform a QR-update of Ai and Ai using "CO-QR- Update". The 
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I/O complexity of this procedure is 

T'(N, P) = T{P) + {N/P - l)S{P) 

If P < ^/M, then both T{P) and S{P) are 0{P'^/B). Therefore, T'{N, P) ^ O (^). If 
P > VM, then both T(P) and 5(P) are 0{r^); T'{N, P) = O (;^). 
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Banded Hessenberg Reductions and 
Related Problems 

3.1 Introduction 

Computing of eigenvalues and singular values of matrices have wide-ranging applications in 
engineering and computational sciences such as control theory, vibration analysis, electric 
circuits, signal processing, pattern recognition, numerical weather prediction and informa- 
tion technology, to name a few [6||33 | [65 |I120] . 

The eigenvalues of a square matrix are typically computed in a two stage process. In 
the first stage the matrix is reduced, through a sequence of orthogonal similarity transfor- 
mations (OSTs), to a condensed form (Hessenberg form, if the matrix is nonsymmetric, 
and tridiagonal form, otherwise), and in the second stage an iterative method called the 
QR algorithm is applied to the condensed form [5811115] . The reduction of a nonsymmetric 
N X N matrix to upper Hessenberg form as well as a symmetric N x N matrix to tridiag- 
onal form using OSTs takes 0{N^) flops. Each iteration of the QR algorithm takes 0{N'^) 
and 0{N) flops respectively, when applied on Hessenberg and tridiagonal matrices. 

The singular values of a matrix are also typically computed in a similar two stage process. 
First the matrix is reduced to bidiagonal form using orthogonal equivalence transformations 
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(OETs); this takes 0{N^) flops. Then the modified QR algorithm [53| is applied to the 
bidiagonal matrix. Each iteration of the QR algorithm takes 0{N) fiops. 

(An OST on a square matrix A G M^^^ transforms A into Q^AQ using an orthogonal 
matrix Q. An OET on a matrix A G M^^^ transforms A into Q^AP using orthogonal 
matrices Q and P. OSTs and OETs preserve eigenvalues.) 

Thus, in each case, the first stage of the computation, which costs 0{N^) fiops, forms 
a bottleneck. Each of these algorithms also needs 0{N^/B) I/Os to perform its 0{N^) 
operations, as we shall see in Chapter 4. 

The traditional Hessenberg, tridiagonal, and bidiagonal reduction algorithms based on 
Householders or rotators are rich in V-V and V-M operations, and not in M-M opera- 
tions [5811115] . Householder based sequential and parallel algorithms for Hessenberg, tridi- 
agonal and bidiagonal reductions using the slab approach have been proposed [15 | B6 | [89]. 
But even in these, V-M operations dominate the performance: after the reduction of each 
column of a slab, the rest of the matrix needs to be read in to update the next column before 
it will be ready to be reduced. This is because of the two sidedness of the transformations 
involved. This also makes it difficult to design tile based versions of these algorithms. To 
the best of our knowledge, no tile based algorithm has been proposed for the above direct 
reductions. 

Due to the above reasons, it has been proposed that the reduction in the first stage be 
split into two steps [IZlliniElllTTllHn], the first reducing the matrix to a block condensed 
form (banded Hessenberg, symmetric banded or banded, as is relevant) [TT1[6T1[8T], and 
the second further reducing it to Hessenberg, tridiagonal or bidiagonal form [2T | [77 t [78 | [80] . 
Almost all operations of the first step are performed using M-M operations. Though 
the modified first stage takes more fiops, its richness in M-M operations, makes it more 
efficient on machines with multiple levels of memory. Usually these reductions are imple- 
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mented using relatively expensive two sided orthogonal transformations rather than the 
inexpensive single sided Gauss elimination, because orthogonal transformations guarantee 
stability [T02l[TT5] . 

We now present an overview of the results in this chapter. 

3.1.1 Reduction of a nonsymmetric matrix to banded Hessen- 
berg form 

Suppose the input is a nonsymmetric matrix A G M^^^, and it is to be reduced to banded 
upper Hessenberg form Ht of bandwidth t. This could be done using an OST: construct an 
orthogonal matrix Q G M.^^^ such that Ht = Q^AQ [T7ll58ll81ll83] . A slab based sequential 
algorithm |58] , and a parallel algorithm for message passing multicomputers [17] are known. 
Tile based algorithms for multicore architectures using Householders and Givens rotations 
are presented in [HTIISS] . 

We study the slab based algorithm of [58], and analyse it for its I/O complexity for 
varying values of M and A^, for a given value of t. This algorithm assumes that the slab 
size k = t. We generalise this into an algorithm with k not necessarily the same as t. We 
find that the algorithm performs the best when k = min{t, a/M}. 

We also investigate the I/O complexity of some tile based algorithms [27 1 [M f 190]. 

These algorithms, typically choose t as the tile size, and also at any time, maintain only a 
small number of tiles in the main memory. If t ^ y/M, then most of the main memory stays 
unused. We propose an algorithm that allows the number of tiles kept in the main memory 
to depend on the tile size. We also propose an algorithm for t > a/M, which partitions the 
matrix into tiles of size v/M/3; this gives a constant factor improvement over the choice 
of t as the tile size. We conclude that banded Hessenberg reduction can be achieved in 
0(A^(A^ - t)V(5min{t, a/M})) I/Os using the tile approach, when (N - t) = Q{N). 
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3.1.2 Reduction of a symmetric matrix to symmetric banded 
form 

Suppose the input symmetric matrix A G M^^^, reduces to symmetric banded form using 
a two stage approach called successive band reduction (SBR) [201 - I22] . Here we consider 
the first stage of SBR. 

Suppose the input is a symmetric matrix A G M^^^, and it is to be reduced to symmetric 
banded form Tt of bandwidth 2t + 1. This could be done using an OST: construct an 
orthogonal matrix Q G R^""^ such that Tt = Q^AQ [HIEHIIHIIIHS]. In symmetric banded 
reduction, due to symmetry only the lower or upper banded portion needs to be stored. 

The problem is well studied in parallel and out-of-core contexts p ^ [TT |[55l[60llll9j . All 
these algorithms are slab based. The performance of these algorithms depends upon various 
factors like data layout, the final bandwidth, and the slab width. Usually the slab width 
k is assumed to be same as the desired bandwidth t pT ] IT9 | [55 | l60]. 

We look at the out-of-core algorithm of [60] in particular. Parallel version of this algo- 
rithm appears in [HIIIH] . A parallel algorithm using PLAPACK was proposed in |119j and 
also implemented in [lOlllI]; these assume that k is not necessarily equal to t. 

The block-tridiagonal divide-and-conquer (BD&C) algorithm of [56] computes eigen- 
values of banded matrices directly instead of further reducing it to tridiagonal form. 
See [inillllES] for further details on BD&C These works also discuss the symmetric band 
reduction. In particular, in [1^ it is mentioned that k should always be less than or equal t 
for optimum performance. The reduction of a banded form to tridiagonal form is discussed 
in [2IIE51I721I771I951I97]. 

In this chapter, we recall the standard the slab based algorithm for the problem. Our 
tile based algorithms for banded Hessenberg reduction work for symmetric band reduction 
too, and perform about half as many I/Os. 
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3.1.3 Reduction to banded form 

Suppose the input matrix is A G R^^^, and it is to be reduced to banded form Bt of 
bandwidth t. This could be done using an OET: construct orthogonal matrices U G M^^^ 
and V G R^""^ such that Bt = U^AV [6ll[82l[83]. If > P then Bt would be in upper 
banded form, otherwise in lower banded form. 

A slab based parallel algorithm in the context of distributed memory for the two stage 
reduction is described in [61] . An algorithm for the first stage on a multicore architecture 
is given in [S21E3]- The second stage is discussed in [TSllSn]- See Chapter 4 for details. 

We describe slab and tile based algorithms for the problem and analyse them for their 
I/O complexities, as in the cases of banded Hessenberg and symmetric band reductions. 

3.1.4 Organisation of this Chapter 

The rest of the chapter is organised as follows: Section 2 discusses the reduction of a 
nonsymmetric matrix to banded Hessenberg form using the slab approach. In Section 3, 
we recall and propose tile approach algorithms for banded Hessenberg reduction. Reduction 
of symmetric matrix to banded form using the slab approach is discussed in Section 4. In 
Section 5, we propose tile based reduction algorithms for symmetric band reduction. The 
reduction of a general matrix to upper banded form using the slab approach is given in 
Section 6. In Section 7, we describe some tile based algorithms for the reduction of a 
general matrix to upper banded form. 

3.2 Reduction of a Nonsymmetric Matrix to Banded 
Hessenberg Form using the Slab Approach 

Let A G M^^^ be the nonsymmetric matrix that is to be reduced to banded upper Hes- 
senberg form Ht of bandwidth t. 
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3.2.1 Reduction with a slab width of t 



The algorithm of [58], when N — t > t, partitions the matrix into slabs of width t, and then 
proceeds in N/ 1 — 1 iterations, each of which reduces a slab using block QR factorisations 
and then updates the rest of the matrix from both sides using aggregated Householders 
The algorithm is illustrated below: 

Consider the first slab of A: 





t 


N-t 






r Au 


A12 ' 


t 


A = 










A21 


A22 


N-t 



Perform a QR factorisation on ^421 such that A21 = Q2iR2i- Then an OST with Q = 
It \ . 

g^J^"'" 

/ Au I A12Q21 \ 
Q'AQ = 

\ -R2I Q\\A22Q2\ j 

If Q is in WY-representation, this involves computing a product of the form il^WY'^^'^ Ail^ 
WY^). Now the first slab is in banded Hessenberg form. Repeat this N/ 1 — 1 times, and 
the matrix reduces to banded upper Hessenberg form Ht: 



( Hn H12 
H21 H22 



H' 



32 



Hr.N 



\ ■■■ H Ki^K-i) Hnn j 
where each is a t x t tile; is a full tile when i < j, upper triangular when i = j + 1, 
and all zero otherwise. Thus, Ht is banded upper Hessenberg with bandwidth t. 

The I/O complexity of this algorithm can obtained from that of Case-1 in Algorithm [SU], 
by substituting k = t. 
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If — t < t then A is partitioned as follows: 





N-t 


t 






r An 


A12 ' 


t 


A = 










_ A21 


A22 . 


N-t 



Perform a QR factorisation on A21 such that A21 = Q2iR2i- Then an OST with Q = 

( ^* ^ ] gives AQ which is in the desired banded Hessenberg form. The I/O com- 
\ U Q21 J 

plexity of this is as given in Table [STTl (Refer to the table in Chapter 2). 

Table 3.1: The case of (A^ ~ t) < t. The number of I/Os for QR decomposition, matrix 
multiplication, and the total are given in columns titled QRD, MM and Total respectively 



Conditions on M & - t 


QRD 


MM 


Total 


N -t>M 


{N-tf 

Vmb 


N{N-tf 
y/MB 


N{N-tf 

Vmb 


N -t<^I 


(N-tf 
B 


N{N-t) 
B 


N{N-t) 
B 


M^/'^< N -t<M 


{N-tf 
-/N^B 


N{N-t)^ 
y/MB 


{N-ty^ . N{N-ty^ 

y^N^B VMS 


y/M<N-t< 


(N-t)* 
MB 


N{N-tf 
y/MB 


(Ar_f)4 N{N-t)^ 
MB ' VMB 



3.2.2 Reduction with a slab width not necessarily equal to t 

In the above algorithm, if the slabs to be QR decomposed are very small compared to the 
main memory ((A^ — it) x t M), then the main memory would seem under utilised; 
a larger slab width might be appropriate. Also, if the slab is too big to fit in the main 
memory ((A^ — it) x t ^ M then the QR factorisations and subsequent updates are to 
performed out-of-core; a smaller slab width might do better. We present an out-of-core 
algorithm (see Algorithm 13. ip that uses a slab width of k not necessarily the same as t. 

Algorithms of a similar vein have been proposed for the reduction of a full symmetric 
matrix to symmetric banded form [TOlfTTllTTQ] , and it has been observed |10j that choosing 
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Algorithm 3.1. Banded Hessenberg reduction using the slab approach 

Input: An N x N matrix A, bandwidth t and algorithmic block size k. 

Output: H = Q^AQ, H is N x N banded upper Hessenberg with bandwidth t, a number 
of Householder aggregates with Q as their product. 

Case-l: k <t. For ease of exposition, assume that k divides t 
for i = 1 to {N -t)/t do 
Let g = {i — l)t; 
for j — 1 to t/k do 
Let h^{j - l)k; 

QR-Decompose{A[g + h + t + l: N, g + h + 1: g + h + k]); 

Let Qij = (/ + YijTijY^^) be the resulting compact WY representation of Q; 

Update the rest of the matrix: 

Left-multiply A[g + h + t + 1 : N, g + h + k + l: N] with Qij; 
Right-multiply A[l : N; g -\- h -\- 1 -\- 1 : N] with Qij; 
endfor; 
endfor; 

Case-2: k > t. For ease of exposition, assume that t divides k; 
for i = 1 to {N -t)/k do 
Let g — {i — l)k; 

Let y denote the range g + t + 1 : N; 
Let z denote the range g -\- k : N; 

My^ y] = ^[y^ y]i 

for J = 1 to k/t do 
Let h^{j - l)t; 

QR-Decompose{A[g + h + t + l : N, g + h + 1: g + h + t]); 

Let Qij = (/ + YijTi.jY^) be the resulting compact WY representation of Q; 

Let Yi = Yij if j = 1 and Yi = {Yi Yij) otherwise; 

( T- T Y^Y T - \ 
Let Ti = Tij if 3^1 and Ti^i ^ ±^l^^^J±^J \ ^^/^g^^^g. 

Let X denote the range g + h + t + 1 : g + h + 2t; 
If (j ^ k/t) do 

Update the next t columns of the panel i: 

Compute Bij = A[y, x] + AYiTi [Y^[l : h + t, x]) ; 
Compute A[y, x] = {I + YiTiY^)Bij; 
end if; 
end for; 

Update the rest of the matrix: 

Right-multiply A[l : g + t, y] with (/ + YjTiY^); 
Compute A[y : z] = A[y : z] + AYiTi : k, z]); 

Left-multiply A[y : z] with (/ + YiTiY^^); 
endfor; 
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a value of at most t for k will do better than otherwise. To the best of our knowledge, 
no such out-of-core algorithm has been proposed for banded Hessenberg reduction. Case-2 
of Algorithm 13.11 has been inspired by the parallel algorithm of |119j . while Case-1 is a 
generalisation of the algorithm described in Subsection 13.2.11 |58j . 

Algorithm 13.11 divides A into vertical slabs of k columns each. Given the values of M, 
N and t, the I/O complexity of the algorithm depends upon the choice of k. We analyse 
the two cases of t < k and t > k separately, with an intent of finding the choice of k that 
would minimise the number of I/Os. 

3.2.2.1 Case 1: k<t 

For 1 < i < {N — t)/t and 1 < j < t/k, in the (i, j)-th iteration there are a QR decom- 
position of an (A^ — g — h — t)xk matrix, a matrix multiplication chain of dimensions 
( A^ — g — h — t,k,k,N — g — h — t,N — g — h — k), and a matrix multiplication chain of dimen- 
sions (A^, N —g — h—t, k, k, N —g — h—t), where g = {i — l)t and h = {j — l)k. Let a, (3 and 7 
denote the I/O complexities of these operations, in that order. Clearly, 7 > /3, in all cases. 
As i varies from 1 to (A^ ~t)/t and j varies from 1 to t/k, {N — g ~ h — t) takes on values 
Ik for (A^ ~ t)/k > I > 1. The slab fits in the memory when (A^ — g — h — t)k = Ik"^ < M; 
that is, / < M/fc2. 

The following table gives the asymptotic I/O complexity of the l-th iteration, for / > 1. 
We omit the 0-notation for brevity. (Refer to the table in Chapter 2.) 

Let c = a/M /k. 

If c > 1, then the table simplifies as follows: 

Summing over the iterations, the I/O complexity is as given in Table [331 For example. 
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Table 3.2: a, 7 and their sum in the l-th iteration for various values of / 





a 


7 


a + 7 




k 


Nik 
B 


Nik 
B 




ik^ 

B 


Nik 

B 


Nik 
B 


<l< c^y/k 


l^k* 
BM 


(1 + ^)^ 


/-111, lk^\ Nik 
y- ^ NM ) B 


c^Vk <l<^k 


Ik^ 
BVk 




/-111, k'^ \ Nik 
y^ NVk) B 


c^k < I 


Ik^ f I , 1 \ 
B \Vk^ Vm J 




/1 1 1 1 k'^ 1 fc2 \ 

I, c NVk nVm) B 



Table 3.3: a, 7 and their sum in the Z-th iteration for various values of Z, when c > 1 





a 


7 


a + 7 




k 


Nik 
B 


Nik 
B 


k — 


ik"^ 

B 


Nik 
B 


Nik 
B 


<l< 


fik^ 
BM 


Nik 
B 


/ 1 1 Ik'-^ \ Nik 
\ NM ) B 


(?y/k < I 


Ik^ 
BVk 


Nik 
B 


{-1 _i_ k^ \ Nik 
y^ NVk) B 



(N-t) 



when 1 < 



< c\ Eifr*^^' a + 7 = Ylti'^" Nlk/B = N{N - tf/Bk. 



(N-t)/k 



Table 3.4: The I/O complexity for Case 1 and c > 1, under various conditions 



Condition 



1/Os 



Condition paraphrased 



1 < < C2 



N{N-tf 
Bk 



k < 



M 



C2 < < c'^/k 



N{N-t)^ , k{N-t)^ 

Bk + BM Bk^ 



M2 



M 



(N-t) 



<k< 



_M2 

(N-t)'^ 



'Vk < 



(N-t) 



N{N-tf _|_ k{N-tf-M^ 



Bk 



BVk 



M^ 



< k 



Consider the following five statements: (The blank is to be filled in by the phrases 
hsted below to obtain the five respective statements.) There exist values M, N and t with 



52 



CHAPTER 3 



Banded Hessenberg Reductions and Related Problems 



t < vM, (which makes it possible to choose k — t with c = = > 1 satisfied) and 
is less than the cost of k = t. 



f- t > ^j^^^a and so that the cost of k — 



2- ^ > (j^^)2 and so that the cost of the best k in {j^, p^^j 

3- ^ > (N-t)^ that the cost of some k in ( (n-£)^ ■, t) 



^ <t< 7#7N7 and so that the cost oik- ^' 



5- 7^ <t < {N-t)'^ t^^t *^'-'^t some A; in ( i) 



We claim that each of these statements is false. Proofs by contradiction follow: 

Define a = logjv/t, h = log^(Ar - t). Then a, 6 > 0, i = M°, (AT - i) = and 
AT = + Thus, t < VM ^ a < 1/2. For Statements 1, 2 and 3, therefore, 

t > ^a>2-2b^b>l-a/2^b> 3/4. So = (M« + M'') ^ MK 

For Statement 1, when k = M/{N - 1) the cost is ^^^7*^' = NM^^-^/B. When A; = t. 





Bk 

the cost is < ^^^^^ + = {NM^^-^ + M2''+»/2)/S. But, ATM^^'-i < M^^'+^Z^ ^ 

46 - 1 < 2& + a/2 =s> (a/2 + 1) > 26 > 3/2 =S> a > 1. Contradiction. 

The cost for k in (]^, (T^^] is lower bounded by: ^^^^/f' which is minimised 

at A; = = M2-2^ For Statement 2, the cost at A; = M^-^^ is at most (M^''-^ _ 

M^^-^)/B ^ M^^-'^/B. When A; = t, the cost is, as before, (iVM^''-" + M^''+''/^) / B . But, 
^56-2 ^ j^2b+a/2 ^ - 1 < 2b + a/2 ^ {a/2 + 2) > 36 > 9/4 ^ a > 1/2. Contradiction. 

For Statement 3, the cost at t is {NM^^-" + M"/2+2'' - M2-°/2)/5. The cost at 
< < tis {NM^''-''+M''/^+^^-M^-''/^)/B. But, NM^\M-''-M-'')+M^''{M''/^- 
M^/2) - M2(M-"/2 - M-^/2) > ^ (2 - (a + c)/2 > 36 - a/2 - c) V (26 > 36 - a/2 - c) ^ 
6 < 3/4. Contradiction. 
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For statements 4 and 5, M^''' < < M^^^f* ^(l-6)<a<2-26^6> 1/2. That 
is, a < 1/2 < b. 

Consider Statement 4. When k = M/{N - 1) the cost is = NM^^-^/B. When 

k = t, the cost is < ^^^^^ + = (iVM^^-^ + M"+3^-i)/5. But, iVM^^-i < 

^a+3b_i ^ jY < TVf'^'. Contradiction. 

For Statement 5, the cost at t is {NM'^^-'' + Af^+^^^i - M^-^")/^. The cost at 
< < t < is (A^M^''-^ + M^+3''-i - M2-2c)/5. But, NM^^M-"" - M"^) + 

M^b-i(^^a _ _ M^^M-"^" - M~2c) >o^a + c>lV6<(2-c)/3. As c<a< 1/2, 
a + c> I cannot be. On the other hand, b < (2 - c)/3 ^ = M^^^^^ > M^^+^^Z^ ^ 

c > (1 + c)/3 ^ c > 1/2; contradiction. 

What we have shown is that when t < \/M, t is a better choice for k than any of the 
smaller values. An analogous proof shows that if \/M < t, then a/M is better than any 
smaller value. 

If c < 1, (that is, k > a/M) then Table simplifies as follows: 

Table 3.5: a, 7 and their sum in the l-th iteration for various values of /, when c < 1 





a 


7 


a + 7 


1 < / < c'^y/k 


BM 


Nik 
cB 


\c ^ NM J 


Nik 
B 


C^y/k <l <c^k 


Ik'"* 
BVk 


Nik 
cB 


(1+ ^ 

\c NVk/ 


\ Nik 
1 B 


c^k < I 


Ik'-* f 1 , 1 \ 
B \Vk ^ VTi) 


Nik 
cB 


A. 

\c NVk 


NVM J B 



Choosing c < 1 we would have v^M < k < t. As the simplified tables above show, any 
such choice would be worse than the choice oi k = a/M in the c > 1 case discussed above. 
That is, if a/M < t, then ^/M is better than any larger value oi k < t. 
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Table 3.6: The I/O complexity for Case 1 and c < 1, under various conditions 



Condition 



I/Os 



Condition paraphrased 



1 < < c^Tfc 



N{N-tf _|_ fc(jV-t)3 



k < 



(Af-t)2 



N{N-t)^ , k{N-tf-M^ 
bVm B\fk 



M 



(N- 



^ < k Sind (N -t) < M 



c'k<^-^ 



N{N-tf I k{N-t)^-M^ I k[N-tf-kM'^ 
BVM BVk By/M 



M2 



< k and {N -t) > M 



If A; is to be at most t, choosing it as mm{t, vM} is the best. 



3.2.2.2 Case 2: k > t 



For 1 <i < {N — t)/k and I < j < k/t, in the (i, j)-th iteration of the inner loop there are 



a QR decomposition of an {N — g — h — t) x t matrix, 



a matrix multiplication chain of dimensions {h, N — g — t, t, t), 

a matrix multiplication chain of dimensions {N — g — t,N — g — t,h + t,h + t,t), and 

a matrix multiplication chain of dimensions {N — g — t,h + t,h + t,N — g — t,t), 



where g — {i — l)k and h — {j — l)t. Let a, 7 and 5 denote the I/O complexities of 
these operations, in that order. For 1 < i < (N — t) / k, in the i-th iteration of the outer 
loop there is a 

• a matrix multiplication chain of dimensions {g + t, N — g — t, k, k, N — g — t), 

• a matrix multiplication chain of dimensions {N — g — t,N — g — t,k,k,N — g — k), 
and 

• a matrix multiplication chain of dimensions (A^ — g — t,k,k,N — g — t,N — g — k). 
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Let n, ip and (p denote the I/O complexities of these operations, in that order. 

Let c — s/M /t. Proceeding as in the analysis of case 1, we find that as i varies from 1 to 
{N — t)/k and j varies from 1 to k/t, {N — g — h — t) takes on values It for {N — t)/t > Z > 1, 
and therefore a in the l-th iteration is as given in the table below: (We omit the order 
notation for brevity.) 

As < g < N - t - k, we have that t < g + t < N - k, N - t > N - g - t > k srnd 
Table 3.7: a in the l-th iteration for various values of I, when k > t 





a 


l<l<f 


t 


f<l<c^ 


B 


c^<l< c^^Tt 


ft* 
BM 


c^^/i <l<cH 


B\/t 


cH < I 


B ^ VM J 



N-k>N-g-k>t. Therefore, 

Elf,-"" m o{{^ + i) E'fr"" iM^) 

e£-"" m>) +m+ Hi) < o ( + 1) E!f = o (^i^i + 1) ) 

Similarly, Elf r"'" Eji', = O (E.C'r"" ^ ( A + 0) ' 

Also, E.S-""= Eji,'i/5(M) + ^(^^) = o (e&""= Ef^'i '"-'-J'"-"" + 1)) 

As N — g — t > h + t, we have: 

{N-t)/k k/t (, . {N-t)/k k/t 2\ 

E 5:wj)+7(M)+*(u)<o (-^+i) ^ e ' r 
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-({^..)iT'^)-({^{A-))) 

The total cost of all matrix multiplications is: O ^Mi^ilL _|_ i _|_ ^jj. 
The total cost of QR-decomposition is: 



Table 3.8: The total cost of QR-decomposition, under various conditions 



Condition 


Cost of QR decomposition 


I < (N-t) ^ B 


N-t 


B_ < < 




^2 < (iV-t) < 


t{N-t)^ 
BM 


c^Vt < < cH 


tiN-tf-M'^ 
Byjt 


cH < (^7*) 


t(N-tf-M'^ . t{N-t)^-tM^ 
Bsfl bVM 



As can be seen, the I/O complexity, the sum of the above two, is independent of the 
choice of k. But, when t < \/M, a choice oi k — t, and when t > Vm, a choice oi k — \fM 
would give lower costs than these. 

Thus, we conclude that the slab based algorithm does the best when k is chosen as 
min{-/M,0- 

3.3 Reduction to Banded Hessenberg Form using the 
Tile Approach 

In this section, we discuss the reduction of a nonsymmetric N x N matrix A into banded 
Hessenberg form using the tile approach. Suppose t is the desired bandwidth. Without 
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loss of generality, we assume that is a multiple of t. Partition the matrix A into tiles of 
size t X t] there are N/t tile rows and N/t tile columns. 

A tile based parallel algorithm for message passing multicomputers for banded upper 
Hessenberg reduction was proposed in [17]. Recently, tile based algorithms have been 
designed in the multicore environment using both Householders and Givens rotations [STl 

We assume that A has already been "tile-transposed" . (We define tile transposition of 
A as the problem of permuting its elements so that each tile is available contiguously in 
column- major order.) As we showed in Chapter 2, if \/M > B (the tall cache assumption), 
for any t, a t X t tiling of A can be found in 0{N'^/B) I/Os and 0{N'^/{t + ^/M)) seeks. 

3.3.1 An Algorithm for t < \fM 

First we consider an algorithm derived from the parallel algorithms of [T7]; it is described 
in Algorithm 13.21 We use the narrow panel techniques of [27 t l6 ^ [90] : QR decompose narrow 
panels of size t x k, aggregate its k Householders for each panel, and apply the aggregates 
from the left and right one by one, by accessing narrow panels of rows and columns. 

Now, the functions invoked by the algorithm are described, and analysed for their I/O 
complexity. We consider three cases where one, two and three tiles fit in the main memory, 
respectively. 

The functions "QR-decompose" , "QR- Update", "QR-Muhiply-from-left-1" , and "QR- 
Multiply-from-left-2" are described in Chapter 2. 

Function QR-Multiply-from-right-l(Ajj, Aij) 

If Aji QR-decomposes into QjiRji, then a pre-multiplication of A by an orthogonal matrix 
obtained by replacing the j-th diagonal tile of Ijy by Qj^ has the effect of making 
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Algorithm 3.2. Banded Hessenberq reduction using the tile approach 

Input: A nonsymmetric matrix A G M^^^ and a parameter t < \fM 
Output: A handed Hessenberg matrix of bandwidth t. 

for i = 1 to N/t - 1 do 
for j = i + 1 to N/t do 
QR-decompose{Aji) ; 
for A; = i + 1 to N/t do 

QR-Multiply-from-left-l{Aji, Ajk) ; 
endfor; 

for / = 1 to N/t do 

QR-Multiply-from-right-l{Aji , Aij ) ; 
endfor; 
endfor; 

for j = N/t to i + 2 step -1 do 
QR-Update{Aj_ii, Aji); 
for A; = i + 1 to N/t do 

QR-Multiply-from-left-2{Aji, Aj^ik, Ajk); 
endfor; 

for / = 1 to N/t do 

QR-Multiply-from-right-2{Aji, Aij_i, Aij); 
endfor; 
endfor; 
endfor; 



Aji upper triangular. In order to complete the similarity transformation, A has to be 
postmultiplied by Q; that is, every tile Aij has to be post-multiplied by Qji. 

The invocation QR- Multiply- from-right-l(74jj, Aij) achieves this update of Aij as follows: 
Bring Aj^ (that is, the Y matrix) into the main memory, and update horizontal slabs of 
Aij of width k one at a time from right using the Y matrices, and the T matrices that 
remain in the memory after the invocation QR-decompose(74jj). How Aij is accessed will 
depend on the number of tiles that fit in the main memory. If only one tile fits, then the 
row slabs of Aij are read, updated and written one after the other. If two tiles fit, then 
Aij can be kept in the main memory. So, the total number of I/Os will be at most 2t^/B, 
irrespective of the number of tiles that fit in the main memory. 
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The number of seeks is 0{t/k) for the one tile case, and 0(1) for the two and three tile 



cases. 



Function QR-Multiply-from-right-2(y4jj, Aih, Aij) 

Here h < j. Suppose Q/yi ^ ^ is a QR decomposition of ^ ^ . Then, in order to 
complete the similarity transformation, Qhji has to be applied from the right to the h-th 
and j-th block columns of tiles. Let 

( Bih Bij ) = ( Aih Aij ) Qhji- 

Say, Q^jj and Rhi have been calculated by an invocation to QR-Update(A/jj, Ajj). At the 
end of the invocation the Y matrices are stored in Aji. 

Bih and Bij can be calculated as follows: 

Suppose the Y (that is, Aji) and T matrices are kept in the main memory. Repeatedly, 
read the next horizontal slabs from Aih and Aij, update them using the Y and T matrices, 
and write them back. Irrespective of the number of tiles that fit in the main memory, the 
number of I/Os is At^ / B\ Aij and Aih are read and written; Aji is read in only once for the 



entire column update. In every invocation in Algorithm 13.21 Aji would be upper triangular 
and therefore the number of additional I/Os per column update would be only OM'^/B. 

The number of seeks is 0{t/k) if one or two tiles fit in the main memory; 0(1) if three 
tiles fit in the main memory. 

I/O Complexity 

We maintain all tiles in row major order. Note that the multiplications from the left 
and right are not done the same way. This asymmetry allows us to dispense with all tile 
transpositions that would have been necessary otherwise. 
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The I/O complexity of the one tile implementation is: 



2 JV/t-1 



E 



N/t / N/t N/t \ N/t / N/t N/t 

J2 2.5+ J2 2.5 + 5^2 + J2 2.5+ ^-^ + ^2^ 

j = i+l \ k = i+l 1 = 1 J j=i + 2 \ k=i + l 1 = 1 



32N^ -61.5N^ + 14.5Nt - 15*^ 



6tB 



6B 



The I/O complexities of two and three tile implementations are: 



,2 

- T 



N/t 

E 

j=i+i 



N/t N/t 
2+ E 2 + E2 

k=i + l 1 = 1 



N/t 

- E 

i=i+2 



2.5- 



N/t N/t 

- E 4 + E4 

k=i + l 1 = 1 



tB 



^lON^ +3Nt - 2t^ 
B 



Note that here the desired bandwidth t decides how many tiles fit in the memory; the 
implementation to be chosen is the one that fits as many tiles as possible. 

If we were to proceed as in [61], then the right updates (with "QR-Multiply-from-right- 
1" and "QR-Multiply-from-right-2" ) in the one and two tile cases would involve reading the 
Y matrix afresh every time a tile is to be updated. In addition to that, after the completion 
of each update, the unreduced tiles would need to be transposed, costing adding extra I/Os. 
Henceforth all right updates in this thesis will be done in our way, and not the one of [6^ . 

In the single tile implementation, a single tile and three and a half narrow panels of size 
t X k occupy the main memory; i.e. t"^ + 7kt/2 < M. In the two tile implementation, two 
tiles together with two and a half narrow panels of size t x k occupy the main memory; 
i.e. 2t^ + 5kt/2 < M. In the three tile implementation, three tiles together with with a 
panel and a half of size t x k occupy the main memory; i.e. + 3kt/2 < M. The half 
panel will hold the T matrices and the full panel will provide space for computations. The 
parameter k is usually kept small; 1 < A; ^ t. 

We also require a number of I/Os to deal with the T matrices. Whether we use the one, 
two or three tiles approach, we have t/k number of T matrices of size k x k each. Since 
the T matrices are upper triangular, that amounts to roughly tk/2 elements. There are 
N/t — i number of single tile QR factorisations and N/t — i — 1 number of two tile QR 
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updates for i = 1, . . . , N/t — 1. So I/Os required to handle the T matrices is: 



N/t-l 

E 

i=l 



kt{N/t + kt{N/t _ /A^(iV - t)k 

2B ~ V tB 



Seek Complexity 



If only one or two tiles fit in the main memory, then the number of seeks needed is 



0( " ^"a ^' )- If three tiles fit, then it is 0{ — ^ 



-)• 



3.3.2 An 0(1) factor improvement in I/Os 

A small modification to the above algorithm can improve the number of I/Os by a constant 
factor. Some of the invocations "QR-decompose(74jj)" , 1 < i < N/t and i < j < N/t, 
and their subsequent left and right updates in Algorithm 13.21 can be avoided, as described 
in [811|83] for multicore architectures, if instead, we, for 1 < i < N/t, QR factorize the 
subdiagonal tile Aj+i^j and then annihilate the lower subdiagonal tiles by invoking "QR- 
Update(74j+i j, Aji)" , for j > i + here Aji is a full tile instead of an upper triangular tile as 
in Algorithm 13.21 The other steps of the algorithm remain unchanged. See Algorithm 13.31 

The functions invoked here are the same as those in Algorithm 13.21 



I/O Complexity 

The I/O complexity of the one tile implementation is: 



2 N/t-l 



t 



E 



N/t N/t N/t j N/t N/t 

2+ E 2.5 + X:2+ 3.5+ 5 + 

k-i + l 1 = 1 j = i+2 \ k = i+l 1=1 



The I/O complexity of the two tile implementation is: 



2 N/t-l 



E 



N/t N/t N/t 

2+ E 2 + E2+ E I 3-5- 

k = i + l 1 = 1 j = i+2 



N/t N/t 

■ E 4+E4 

k = i+l 1 = 1 



22Af3 -35Af2 + 5Nt + 9t^ 



20N^ -31.5Af2 + 2.5Aft + 9i2 
6tB ~^ 6B 
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Algorithm 3.3. Banded Hessenberg reduction using the tile approach, an 0(1) factor 
improvement 



Output: A banded Hessenberg matrix of bandwidth t 

for i = 1 to N/t - 1 do 
QR-decompose{Ai+ii) ; 
for j = i + 1 to N/t do 

QR-Multiply-from-left-l{Ai^ii, A^^i j); 
endfor; 

for A; = 1 to N/t do 

QR-Multiply-from-right-l{Ai^ii, Ak i+i) ; 
endfor; 

for j = i + 2 to N/t do 

QR-Update{Ai+ii, Aji); 
for A; = i + 1 to N/t do 

QR-Multiply-from-left-2{Aji, Ai+n,, Ajk); 
endfor; 

for / = 1 to N/t do 

QR-Multiply-from-right-2{Aji, Ai+i) 
endfor; 
endfor; 
endfor; 



The cost of the three tile case is asymptotically the same as that of the two tile case. 

QR-decomposition is invoked only once for each i — 1, . . . , N/t — 1. So the number of 
I/Os required to handle the T matrices is: 



Seek Complexity 

If only one or two tiles fit in the main memory, then the number of seeks needed is 



Input: A nonsymmetric matrix A e R^^^ and a parameter t 



N/t-l 




0{ 



N'^(N-t) 



). If three tiles fit, then it is *^ ). 
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3.3.3 When r > 3 tiles fit in the main memory 

So far, in the algorithms of this section, we assumed a presence of at most three tiles in the 
main memory. Particularly when t = o(-\/M), the main memory remain unused throughout 
the reduction limiting the performance. If more tiles would fit in the main memory, we 
could improve on memory utilisation and thus reduce the number of I/Os. 

We now present an algorithm that assumes that r > 3 tiles fit in the main memory; 
that is, rt"^ < M. See Algorithm 13.41 This is similar to Algorithm 13. 3[ except in that it QR 
updates r/2 — 1 (instead of one) subdiagonal tiles at a time, and multiplies r/2 (instead 
of two) tiles at a time. 



Algorithm 3.4. Banded Hessenberg reduction using r > 3 tiles in the main memory 
Input: A nonsymmetric matrix A G M^^^ and a parameter t 
Output: A banded Hessenberg matrix of bandwidth t 

s = r/2 - 1; 

for i = 1 to N/t - 1 do 

QR-decompose{AiJ^ii) ; 

for j = z + 1 to N/t do 

QR-Multiply-from- left- i ( Aj+i j , Aj+i j ) ; 

endfor; 

for /c = 1 to N/t do 

QR- Multiply- from-right-l{Ai^ii, Aki+i); 
endfor, 

for J = 1 to \{N/t -{i + l))/s] do 

Let xi, . . . ,Xs respectively be i + (j — l)s + 2, . . . , z + js + 1; 
QR-Update-n{Ai+ii, A^^i, . . .,A^^i); 
for = i + 1 to N/t do 

QR-Multiply-from-left-n{Ai+ik, A^^^fc, • • • , ^x,fc); 
endfor; 

for / = 1 to N/t do 

QR-Multiply-from-right-n{Aii+i, Ai^^, . . . , Ai^J ; 
endfor; 
endfor; 
endfor; 
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The functions "QR-decompose" , "QR-Multiply-from-left-1" , and "QR-Multiply-from- 
right-1" are the same as in Algorithm 13. 31 

Function QR-Update-n(v4i+i i, A^.^ j, . . . , A^^ i) 

The invocation of "QR-Update-n(Aj+i j, A^,^ j, . . . , A^^i)" annihilates the s = r/2 — 1 tiles 
Ax-i^i, . . . , Ax^i from the current block column i by taking the upper triangular tile Aj+i^j as 
pivot, overwrites the st x t sized Y matrix in the corresponding s annihilated tiles. The Y 
and the upper triangular txt sized T matrices are kept in memory. So the I/O complexity 
is 2st^ /B: read and write s tiles. Seeks: 0{s). 



Function QR-Multiply-from-left-n(Aj+i ^t, Ax-^ fc, . . . , A 



Xs k) 



This function updates the s + 1 tiles Ai_^.ik, A^^^, . . . , A^^^ from the left using the in-core 
Y and T matrices. So the I/O complexity is rt^ / B: read and write (s + 1) tiles. Seeks: 
0{s). 

QR-Multiply-from-right-n(A/i+i, A/^.^, . . . , ^^.J 



This function updates s tiles j+i, A/j,^, . . . , Ai^^ from right using the in-core Y and the 
T matrices. So the I/O complexity is rt^ / B\ read and write [s + 1) tiles. Seeks: 0{s). 

I/O Complexity 

The I/O complexity of the algorithm is: 



B 



JV/t-l 

E 



2s - 



N/t N/t 
fc = i + l i = l 



5r7V3 (-15r + 24s)Ar2 + (lOr - 24s) 



6stB 



6sB 



When r is large this is + o 



tB 
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"QR-Update-n(A+ii,^xii,...,^x,i)" is invoked [Mzii±ll] times for i = 1,..., N/t-l. 
So I/Os required to handle the T matrices is: 



As can be seen from the I/O complexity this algorithm does better than the other 
algorithms discussed in the previous sections. Also it can be observed that if rt^ < M, i.e., 
the entire block column fits in memory then the algorithm behaves like the slab approach 
algorithm of Section 13.2.11 

As expected this algorithm reduces the I/O cost by a factor of two over Algorithm 13. 3[ 
Seek Complexity 

The number of seeks is O ( j ^ as in the three tile implementations seen earlier. 



matrix A into wxw tiles; three wxw tiles can fit in the main memory. Then apply a mod- 
ified version of Algorithm 13.31 (see Algorithm 13. 5p . In Algorithm 13.51 the QR-decomposition 
of the i-th block column starts from the tile Aj+^j instead of Aj+i j. Also the block columns 
reduced are 1 to (A^ — t)/w, instead of 1 to N/t — 1. 

The functions invoked here are the same as those in Algorithm 13.31 
I/O Complexity 

Assume that three tiles fit in the main memory. 
The I/O complexity of the algorithm is: 



N/t-l 




3.3.4 An algorithm for t > VM 



Without loss of generality, assume that t is a multiple of w 



a/M/S- Divide the input 
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Algorithm 3.5. Banded Hessenberg reduction fort > y M 



Input: A nonsymmetric matrix A E M^^^ and a parameter t > vM 
Output: A banded Hessenberg matrix of bandwidth t 

Let w = a/ M /3 and s = t/w; 
for i = 1 to — — s do 

w 

QR-decompose{AiJ^si) ! 
for j = z + 1 to N/ w do 

QR- Multiply- from- left- l{Ai+s i , ^i+s j ) / 
endfor; 

for A; = 1 to N/w do 

QR-Multiply-from-right-l{Ai+si, Ak i+s); 
endfor; 

for j = s + i + 1 to N/w do 
QR-Update{Ai+si, Aji); 
for k = i -\- 1 to N/w do 

QR-Multiply-from-left-2{Aji, Ai+^k, Ajk); 
endfor; 

for / = 1 to N/w do 

QR-Multiply-from-right-2{Aj i, A^+s, Aij); 
endfor; 
endfor; 
endfor; 



_ {20N^ - UN'^t + mt^ + 12*3) _i9.5Ar2 - Nt - S.St^ - 0.5(Af - t)w 
~ &wB ' 6B 

Note that when t = w, Algorithms 13.31 and 13.51 have the same I/O complexity. 
The number of I/Os needed to deal with the T matrices is 



N-t 



i=l i=l j=i+l+t/w ^ 



B 



Seek Complexity 



The number of seeks is: O 



ATS 



MVM 
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3.4 Reduction of a Symmetric Matrix to Symmetric 
Banded Form using the Slab Approach 



Reduction of a symmetric matrix A G R^^^ to symmetric banded form with a semi band 
width t < A^, is usually the first step of tridiagonal reduction [201422] . This is usually done 
by finding an orthogonal transformation Q such that Q^AQ is in the desired form. Due to 
symmetry only the upper (or lower) banded portion needs to be stored. Symmetric band 
reduction has received a lot of attention [T0 | [TT | |55 | [60 |I119] . 



3.4.1 Reduction with a slab width of t 



It is typical to reduce the matrix is by assuming that the algorithmic slab size k is same 
as the desired bandwidth [TTl|T9l[55[|60] . Algorithms with this assumption have been im- 
plemented on parallel machines [HlIlH]- An out-of-core algorithm with this assumption in 
the context of symmetric generalised eigenvalue problem is given in [60] . 

In this section we discuss the out-of-core algorithm of [60] which proceeds as follows. 
(See Algorithm 13.61 ) 

Let the input symmetric matrix A G R^^^ be partitioned into tiles of size t x t, and 
let X be the first slab (block column) of A excluding the diagonal tile. 



/ An Al, ... Al^ \ 



A 



^21 

A2I A22 



A^ 





( A21 


\ 




A31 




X = 






\ Am 


I 
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QR factorise X and aggregate the t Householder transformations into a matrix Q in WY 
representation [23] . 

^ ~ I / + WY^ 
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Apply Q on A from both sides. This reduces the first block column and block row of A, 
and modifies all other tiles of A: 



Q^AQ 



Let A^"^^ and A^'^^ be the intersections of block rows and columns numbered from 2 to N/t, 
respectively, before and after the transformation. Then 



/ An 


R21 


■ 





R21 


A22 


^32 







A32 


A33 ■ 




I 


An n 
T ^ 


■ 


Ann 

t t 



i(2) = (J + W^)^A(2)(/ + 



Expanding, 



(3.1) 



(3.2) 



(3.3) 



Letting S = A^^^W and = W^A^^^W and substituting for S and V", 
That is, 

i(2) = ^(2) + + {1/2)VY^) + {S+ (l/2)r\/)F^. 

Let T = 5 + (l/2)r\/, and we have: 

Recurse with A^'^\ 

The asymptotic I/O complexity of the algorithm is same as that of the banded Hessen- 
berg reduction described in Section [3.2.1} use k = t. Due to symmetry some computations 
can be avoided. (Note that the use of the YTY^ representation in one the WY^ rep- 
resentation in the other for Householder aggregates do not cause a difference in the I/O 
complexities.) 
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Algorithm 3.6. Reduction to symmetric banded form JW^ 



Input: An N x N symmetric matrix A and a parameter t. 

Output: A symmetric banded matrix of bandwidth t, orthogonally similar to A. 

for k = l, N/t-1 do 
for j = k + 1, N/t do 

Read in Ajk and store in Xj; 
endfor; 

Compute the QR factorisation of X accumulating the matrices W, Y, and R; 

Write R over block A^+ik,' 

Write W and Y to secondary storage; 

Compute S = A^'^^^^W with S overwriting W ; /* Can't overwrite W , needed next line */ 
Compute V = W^S with V overwriting R; 
Compute T = S — 1/2YV with T overwriting W; 
for j = k + 1, N/t do 
for i = j, N/t do 
Read in Aij; 

A.. — A.. — VT^ — TY^ ■ 
Write out Aij; 
endfor; 
endfor; 
endfor; 



3.4.2 Reduction with a slab width not necessarily equal to t 

Algorithm I3.6[ chooses the slab width k = t. The parallel algorithms of p!T |lll9] gener- 
alise by assuming that k could be different from t. The algorithm of jll9] is similar to 



Algorithm 13.11 for banded Hessenberg reduction. The analysis is essentially the same. We 
observed in our discussions on the banded Hessenberg reduction that k should not exceed 
t. It is applicable here too. The same observation is made in [TP] . 
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3.5 Reduction to Symmetric Banded Form using the 
Tile Approach 

Suppose the input symmetric matrix A G R^^^ has been partitioned into tiles of size txt. 
Without loss of generality, assume that is an integer multiple of t. The algorithms of 
Section [3.3.21 work as they are. 

Some improvements in I/O are possible though. When A is to be transformed into 
orthogonally similar Q^AQ, and , from left, applies to a set S of rows, then Q from the 
right applies to the corresponding set of columns. Thus, only the elements which are on 
the intersection of the concerned rows and columns need be updated from the right. So 
some of the right updates in those algorithms can avoided, thus saving I/O as well as flops. 

The algorithms when analysed keeping the above in mind, expectedly, are found to 
require about one half the number of I/Os and seeks as in the banded Hessenberg reduction. 
We omit the details. 

3.6 Reduction to Upper Banded Form using the Slab 
Approach 

The reduction of a general matrix to banded form using orthogonal equivalence transfor- 
mations is discussed in [611 [821 [83]. Given a matrix A G R^^^, N > P, and a desired 
bandwidth t, our aim is to find orthogonal transformations U G M^^^ and V G R^^^ 
such that Bf = U'^AV is in upper banded form of bandwidth t; i.e., Bt{i,j) = whenever 
\i — j\ > t. If N < P, we reduce the matrix to a lower banded form. It is easy to generalise 
these algorithms to get reduction into a matrix with m upper and t lower diagonals; we will 
not discuss this further. Also, we recall only the reduction to upper banded form {N > P); 
the reduction to lower banded form (A^ < P) is analogous. 
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In this section, we discuss Householder based upper banded reduction (A^ > P) using 
the slab approach. Two factorisation methods, QR and LQ, are interleaved from the left 
and right respectively to reduce the matrix to banded form. A slab based parallel algorithm 
for the problem was given in |6T]; an external memory implementation of that is discussed 
here. See Algorithm 13.71 for a description. 

In the algorithm the notations As and Al stand for the submatrices which have to 
be QR and LQ factorised respectively in the s-th iteration. Similarly the notations A^J^^^ 
and A^^^"* stand for the submatrices that have to be updated by orthogonal matrices U 
and V from the left and right respectively in the s-th iteration. The computation proceeds 
as follow: 

Let the remaining unreduced portion A oi A after (s — 1) iterations be represented as 
^^), where Af^'^ e Ri^-i^-m>^t ^(RU) ^ j^[N-{s-i)t]x{p-st) submatrices. 

( R P\ 

Then QR factorise Ai into QsRs using the slab based algorithm, and aggregate the t 
Householders into a matrix Qg using, say, the I + YTY^ representation [23]. Apply 
from the left: 

Now the s-th block column is reduced. We next reduce the first block row of ^i^^) to 
banded form using LQ factorisation. Again partition the updated A as follows: 



P-st 



A 



R' Ai''^ 



t 

N-st 



Ai^^) 

Now LQ factorise A^^^'^ = LgPj . Apply Pg from right 



APg 



t P-st 
R's 

Ai^^)p. 



t 

N-st 



Now the s-th block row is reduced to upper banded form. Continue this another P/t — l — s 
times to reduce the matrix to upper bidiagonal form. 
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Algorithm 3.7. Reduction to banded form 161] 

Input: An N x P (N > P) matrix and a parameter t. 

Output: An orthogonally equivalent upper handed matrix of bandwidth t. 

i = 0; 

while i < P do 

S = i/t + 1; 

Define Ai^^^ =A[i + l:N; i + l:i + t]; 
Define Af^^ =A[i + l:N; i + t + l:P]; 
Ai^^^ = QsRs; /*QR decomposition */ 
Ai''''^ ^Q^Ai''''^ /*update*/ 
if i + t < p do 

Define aI^^^ = A[i + I : i + 1; i + t + 1: P]; 

Define Ai^^^ = A[i + t + l: N; i + t + 1: P]; 

Ai^^^ = LgPj ; /*LQ decomposition */ 

Af"^^ 4- Af^^P, ; /*update*/ 
endif; 
i = i + t; 
endwhile; 



The LQ factorisations are analogous to the QR factorisations, except that the House- 
holder vectors are stored in row major order in the zeroed upper triangular portion of 
the matrix. The left updates are analogous to the ones in QR factorisation. (See Chap- 
ter 2.) The right updates are similar to the ones in banded Hessenberg reduction but use 
orthogonal transformations different from the left side. Therefore the I/O complexity of 
the reduction is dominated by the I/O complexity of QR decomposition. The details of 
the analysis are hence omitted. 

3.7 Reduction to Upper Banded Form using the Tile 
Approach 

In this section, we discuss band reduction using the tile approach. Parallel algorithms using 
the tile approach were proposed in [T7l|6T] . Parallel reductions using the tile approach have 
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been proposed for multicore architectures too [821183] . 
3.7.1 An algorithm for t < \fM 

Suppose the N x P, N > P matrix has been partitioned into tiles of size t x t. Without 
loss of generality, assume that and P are integer multiples of t and denote, X = N/t 
and Y = P/t, that is the matrix is partitioned into X xY tiles. The reduction is described 
in Algorithm 13.81 



Algorithm 3.8. Reduction to banded form I8^l8^ 

Input: An N x P (N > P) matrix and a parameter t. 

Output: An orthogonally equivalent upper handed matrix with bandwidth t. 

Let X = N/t and Y = P/t; 
for i = 1 to Y do 

QR-decompose(Aii ); /* QR factorisation*/ 
for j = i + 1 to F do 

QR- Multiply- from- left- l{Aii, Aij ) ; 
endfor; 

for k = i -\- 1 to X do 

QR-Update{Aii, Aki) ; 

for j = i -\- 1 to F do 

QR- Multiply- from-left-2{Aki, Aij, A^j); 

endfor; 
endfor; 
ii i <Y then 

LQ-decompose(Aii^i ); /* LQ factorisation* / 
for j = i -\- 1 to X do 

LQ- Multiply- from-right-l{Ai j+i , Aj j+i) ; 
endfor; 

for k = i + 2 to Y do 
L Q- Update{Ai i+i,Aik); 
for j = i -\- 1 to X do 

LQ- Multiply- from-right-2{Aik, Aji+i, Aj^); 
endfor; 
endfor; 
endif; 
endfor; 
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The functions "QR-decompose" , "QR-Update", "QR-Multiply-from-left-1", and "QR- 
Multiply-from-left-2" invoked here in Algorithm 13.81 are the same as those in Algorithm l3.3l 

Function LQ-decompose(Aj j+i) 

Assume that the tile Aa^i is divided into horizontal slabs of width k each. Let r = 
t/k. The invocation LQ-decompose(Aj j+i) finds matrices Yi, . . . ,Yr, Ti, . . . ,Tr such that 
Ljj+i = Aii^i{It + YiT^Y^) ...(/( + YrT^Yj^) is lower triangular. This is done as follows: 

Read Aa+i into the main memory. Proceed in r iterations. At the beginning of the s-th 
iteration, the superdiagonal elements of the first {s — l)k rows of Aa^i are all zero. In the s- 
th iteration, calculate Yg and Ts such that the superdiagonal elements of Anj^i^It+YsTj'Y^) 
in the first sk rows are all zero. Update Aa^i to Aa^i^It + YgT^^Y^). Store Yg in the 
superdiagonal part of the s-th slab of Aa+i. 

When all slabs are processed, arrange the lower triangular part of Aa^i in column major 
order, and write Aa^i back to the disk. 

The number of I/Os needed is 2t^ / B] Aa^i is read and written; the T matrices are kept 
in the main memory. 

The number of seeks is 0(1) even for the one tile case. 
Function LQ-Multiply-from-right-l(y4j j+i, Aji+i) 

If Aii^i LQ-decomposes into Ljj+iQjj+i, then a post-multiplication of A by an orthogonal 
matrix Q obtained by replacing the {i + l)-st diagonal tile of In by Qa+i has the effect of 
making Aa^i lower triangular. In order to complete the orthogonal equivalence transfor- 
mation, A has to be postmultiplied by Q; that is, every tile Aj i+i has to be post-multiplied 
by Qii+i. 
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The invocation LQ- Multiply-from- right- 1 (^41 j+i, 74jj+i) achieves this update of Aji+i as 
follows: Bring Aa+i (that is, the Y matrix) into the main memory, and update horizontal 
slabs of Aji_^.l of width k one at a time from right using the Y matrices, and the T 
matrices that remain in the memory after the invocation LQ-decompose(74j j+i). How 
Aji^i is accessed will depend on the number of tiles that fit in the main memory. If only 
one tile fits, then the row slabs of ^jj+i are read, updated and written one after the other. 
If two tiles fit, then Aji+i can be kept in the main memory. So, the total number of I/Os 
will be at most 2t^/B, irrespective of the number of tiles that fit in the main memory. 

The number of seeks is 0{t/k) for the one tile case, and for the two and three tile 
cases. 

Function LQ-Update(74i/i, Afc) 



Here Aih is lower triangular. The invocation LQ-Update(Aj/i, Aik)), finds a matrix Q such 
that every element of the (i, k)-th tile of AQ^ is zero. This is done over r — t/k iterations. 
The s-th iteration processes A^^^^ the s-th leftmost vertical slab of width k ol Ai^. Consider 

the matrix C., 
h 



Let ^ y J be matrices such that C^. ylk+t + 

in the rightmost t columns of its s-th horizontal slab. 



h 



Tj( h )^ has zeroes 



Then, ( Aih A^k ) . 



\ 



-^s \ ^{s-l)k 



Ik o: 



t—sk 



has zeroes in 



4 

\ \ ys J J 

the upper triangular part of its s-th horizontal slab. Set ( Ai^Aik ) to this matrix by 
updating A^^ and A^k- The s-th slab A^k is now all zero. Therefore Yg can be stored there. 



Let Qjf. be the product 



( Ot_; 

h I T'[ ( Ot_fe h Y^ ) 

V ^1 



( h 

Qt-k \Tj{h Qt-k Y^) 

\ Yr 
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The above can be implemented as follows: Read Aik- Read Aih one vertical slab at 
a time. For each slab, update the slab, write it back and overwrite the corresponding 
horizontal slab of A^k with the Y matrix. The number of I/Os is at most 3t^/ B, irrespective 
of whether one tile or two tiles fit in the main memory: one tile and a half are both read 
and written. 

The number of seeks is 0{t/k) for the one tile case, and 0(1) for the two and three tile 
cases. 

Function LQ-Multiply-from-right-2(y4jfc, Aji^i, Ajk) 

Here i + 1 < k. Suppose ( La^i ) -Qa+ik is an LQ decomposition of ( Ajj+i Aik ) • 
Then, in order to complete the orthogonal equivalence transformations, Qa+ik has to be 
applied from the right to the (z + l)-st and fc-th block columns of tiles. Let 

( Bjij^i Bjk ) = ( ^ji+i Ajk ) Qji+ik- 

Say, Qii+ik and Lii+i have been calculated by an invocation to LQ-Update(y4jj_|_i, Aj^). 
At the end of the invocation the Y matrices are stored in Aik. 

Bji+i and Bjk can be calculated as follows: 

Suppose the Y (that is, Aik) and T matrices are kept in the main memory. Repeatedly, 
read the next horizontal slabs from Ajj+i and Aj^, update them using the Y and T matrices, 
and write them back. Irrespective of the number of tiles that fit in the main memory, the 
number of I/Os is At'^/B: Aji^i and Ajk are read and written; Aik is read in only once for 
the entire column update. In every invocation in Algorithm 13.81 the number of additional 
I/Os per column update would be only 0.5t^/B: to read Aik. 

The number of seeks is 0{t/k) if one or two tiles fit in the main memory; 0(1) if three 
tiles fit in the main memory. 
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I/O Complexity 

The I/O complexity for the one tile implementation is: 



p/t j p/t N/t I P/t \ \ P/t-1 / N/t P/t I N/t 

^ 2+^2.5+^ 3+E5 +E 2+^2+^ 3+^4 

i=l \ fc = i + l fc = i+l \ j = i + l / / ! = 1 \ ! = i+l fc=i + 2 \ j = 



2tB 2iP 



tB 



The I/O complexities for the two and three tile implementations are: 
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ANP'^ 


4P3 




tB 





The number of I/Os to deal with the T matrices is asymptotically the same as that of 
the Algorithm 13.31 

Seek Complexity 

If one tile or two tiles are available then number of seeks is O ^"^H^- j • But if three tiles are 
available then it is O (^^^^^ ■ 

3.7.2 When r > 3 tiles fit in the main memory 

Here we propose an algorithm that keeps r tiles in the main memory at a time, where 
rt"^ < M. This is similar to Algorithm 13.41 and is described in Algorithm 13.91 

The functions "QR-decompose","QR-Multiply-from-left-l", and "QR-Multiply-from- 
left-n" invoked here are the same as in Algorithm 13. 4[ Similarly the functions "LQ- 
decompose" and "LQ-Multiply-from-right-1" are the same as in Algorithm 13.81 



78 



CHAPTER 3 



Banded Hessenberg Reductions and Related Problems 



Algorithm 3.9. Reduction to banded form when r > 3 tiles fit in the main memory 

Input: An N x P (N > P) matrix A and a parameter t < y/M. 
Output: An upper banded matrix orthogonally equivalent to A. 

Let s = r/2- 1; 

Let X = N/t and Y = P/t; 

for i — 1 to Y do 

QR-decompose{Ai i); 

for j = i + 1 to F do 

QR-Multiply-from-left-l{Ai j , A^j); 

endfor; 

for A; = 1 to \{X-i)/s\ do 

Let respectively he i + {k — l)s + 1, . . . , i + ks; 

QR-Update-n{Aii, A^^i, A^^i); 

for j = i + 1 to F do 

QR- Multiply- from-left-n{Aij, A^^j, . . . , A^^j); 

endfor; 
endfor; 
if i <Y then 

LQ-decompose(Aii+i ); /* LQ factorisation*/ 

for j = i + 1 to X do 

L Q- Multiply- from-right-l{Ai j+i , Aj j+i ) ; 

endfor; 

for A; = l to \{Y -i-l)/s'\ do 

Let yi, . . . ,ys respectively he i + {k — l)s + 2, . . . , i + /cs + 1; 
L Q- Update-n{Ai i+i, Aiy^, . . . , AiyJ; 
for j — to X do 

L Q- Multiply- from-right-n{A j i+i, Ajy^, . . . , AjyJ; 
endfor; 
endfor; 
endif; 
endfor; 



Function LQ-Update-n(/lii+i, Aiy^, AiyJ 

The invocation of "LQ-Update-n(74jj_|_i, A^y^, . . . , Aj/J" annihilates the s = r/2 — 1 super 
diagonal tiles Aiy^, . . . , Aiy^ from the current block row i by taking the lower triangular 
tile Aii+i as pivot, overwrites the st xt sized Y matrix in the corresponding s annihilated 
tiles. The Y and the upper triangular t x t sized T matrices are kept in memory. So the 
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I/O complexity is 2st^ / B: read and write s tiles. Seeks: 0{s). 
LQ-Multiply-from-right-n(Aj i+i, Ajy^, . . . , Ajy^) 

This function updates s tiles (A^- j+i, Ajy^, . . . , ^jys) from right using the in-core Y and the 
T matrices. So the I/O complexity is rt'^/B: read and write (s + 1) tiles. Seeks: 0{s). 

I/O Complexity 

The I/O complexity of the Algorithm 13.91 is: 
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The I/Os to deal with the T matrices is analogous to Algorithm 13.41 and is 0{NP/rB). 
Seek Complexity 

Since all the Y, T and the tiles to be updates are brought into memory by full tiles instead 
of narrow panels, so the number of seeks is O ( 

3.7.3 Reduction to upper banded form for tile size t > \/ M 



The techniques used in Section [3.3.41 also can be employed here to improve the I/O. See 
Algorithm 13.101 for details. All the functions invoked here are the same as those in Algo- 
rithm 13.81 

I/O Complexity 

The I/O complexity of the algorithm is: 
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Algorithm 3.10. Reduction to banded form for t > y/M 

Input: An N x P (N > P) matrix A and a parameter t > \fM. 
Output: An upper banded matrix orthogonally equivalent to A. 

Let w = \J M /3 and s — t/w; 
for i = 1 to — do 

w 

QR-decompose{Aii) ; 
for I — i + 1 to P/w do 

QR- Multiply- from-left-l{Ai j , An); 
endfor; 

for j = i + 1 to N/w do 

QR-Update{Aii, Aji); 

for k — i + 1 to P/w do 

QR-Multiply-from-left-2{Aj Ai^, Ajk); 

endfor; 
endfor; 

if i < P/w — s do 
LQ-decompose{AiiJ^s) ; 
for j — i-\-l to N/w do 

L Q-Multiply-from-right-l{Ai j+s , Aj j+s) ; 
endfor; 

for k = i + s + 1 to P/w do 
LQ-update{Aii+s, Aik); 
for I — i + 1 to N/w do 

LQ- Multiply- from-right-2{Aik, Au+g, Aik); 
endfor; 
endfor; 
endif; 
endfor; 



The I/Os to deal with the T matrices is as follows: 

2 £ / N/w \ 2 (P-i)/"^' ( P/^ 

+^ E h+ E ^\=o(NPiB) 

i=l \ l=i+l I i=l \ k=i+l+t/w 



Seek Complexity 

The number of seeks is: O ( ) . 
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Reduction from Banded Hessenberg 
Form to Hessenberg Form and 
Related Problems 

4.1 Introduction 

We briefly recapitulate from the Introduction to Chapter 3: The eigenvalues and singular 
values are typically computed in a two stage process. In the first stage the input matrix is 
reduced, through a sequence of OSTs or OETs to a condensed form and in the second stage 
an iterative method called the QR algorithm is applied to the condensed form. All known 
algorithms for the first stage need 0{N^/B) I/Os and 0{N^) flops. Therefore, the first 
stage is usually the performance bottleneck. It has been proposed that the reduction in the 
first stage be split into two steps p nil9ll21ll77ll80] . the first reducing the matrix to a banded 
Hessenberg, symmetric banded or nonsymmetric banded form, as is relevant pT]l6ip81] , and 
the second further reducing it to Hessenberg, tridiagonal or bidiagonal form [2n[77t [78 l [80] . 

In Chapter 3 we discussed the first step of the first stage. In this chapter we study the 
second step. 

Unblocked Hessenberg reductions were known even in 1960s ^85j. Blocked algorithms 
that try to incorporate M-M operations have been invented since then [ISl - HTllSQ] . As we 
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shall show, these algorithms run in 0{N^/B) I/Os on full matrices. If the input matrix 
is in banded Hessenberg form, these algorithms are not suitable, as they don't exploit the 
sparsity in the input. After a few steps of these algorithms, the matrix becomes full with 
the fill-ins created, resuhing 0{N^) flops [2Tl[77l[78] and 0{N^/B) I/Os. To the best of 
our knowledge, no efficient algorithm has been proposed to reduce a banded Hessenberg 
matrix to Hessenberg form. 

For the tridiagonal reduction, if the input matrix is full, the conventional Householder 
or rotation based unblocked algorithms are useful [71 I321I581I115] . Blocked algorithms are 
also known [45j. Parallel algorithms are described in [461166]. These and their straight 
forward adaptations all run in 0{N^) flops and 0{N^/B) I/Os. Again, if the matrix is in 
symmetric banded form, a few steps of these algorithms will fill the matrix with bulges. 
Sequential and parallel algorithms for the tridiagonalisation of symmetric banded matrices 
have been known [20H22l[55l[72l[771[95l[971 . 

For the bidiagonal reduction, if the input matrix A G R^^^ is full, the conventional 
algorithms use Householders to reduce columns and rows, one by one [58 11115] . A House- 
holder based blocked algorithm that performs about half of its work in M-M is described 
in [35]. A bidiagonalisation using one-sided orthogonal transformations was proposed 
in [92]; its stability was later improved in [T^l2^ . These algorithms have an I/O com- 
plexity of 0(A^Pmin(A^, P)/B). Again, if the input matrix is banded, a few steps of these 
algorithms will fill the matrix with bulges, and so these are not suitable for such inputs. 
Sequential and parallel algorithms for reducing banded matrices to bidiagonal form were 
proposed in [78] . 

We now present an overview of the results in this chapter. 

We recall an unblocked and a blocked direct Hessenberg reduction algorithm, and show 
that they both have an I/O complexity of 0{N^/B). For the BH-H reduction we propose 
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two algorithms. The first one requires 0{N^/B) I/Os, but the second one, an extension 
of the first using tight packing of bulges requires only O (^^^ + I/Os when t, the 
bandwidth, is at most \/M. Combining this with the results of Chapter 3, we show that 
the Hessenberg reduction can be performed in 0{N^/\fMB) I/Os, when is sufficiently 
large. This matches the known lower bound on data movement for the problem [13] . 

We recall unblocked and blocked tridiagonal reductions, and show that they have an I/O 
complexity of 0{N^/B). For the symmetric banded form to tridiagonal form reduction, 
we show, the known algorithms take 0{N'^t/B) I/Os. Combining this with the results of 
Chapter 3, we show that the tridiagonal reduction can be performed in 0{N^/\fMB) I/Os, 
when is sufficiently large. This matches the known lower bound on the data movement 
for the problem 



We recall unblocked and a blocked bidiagonal reductions, and show that they have an 
I/O complexity of 0{N P min{N , P} / B), when applied on an x P matrix. For the 
banded to bidiagonal reduction, we show, the known algorithms take 0{mi'n.{N ^ P}H) / B) 
I/Os, when t is the bandwidth. Combining this with the results of Chapter 3, we show 
that the bidiagonal reduction can be performed in O ^ "^'"W^^-^^ j I/Os when the matrix 
is sufficiently large. This matches the known lower bound on the data movent for the 
problem |13] . 



4.1.1 Organisation of this Chapter 

In Section 2, we study some of the direct Hessenberg/tridiagonal/bidiagonal reductions, 
and their I/O performances. In Section 3, we present our new BH-H reductions. Tridiago- 
nal reductions are discussed in Section 4, and Section 5 is a study of bidiagonal reductions. 
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4.2 Preliminaries 

The reductions of matrices to Hessenberg or tridiagonal form (the latter if it is symmet- 
ric, the former otherwise) or bidiagonal form have traditionally been performed without 
constructing an intermediate banded form. We study such direct reductions in this section. 

4.2.1 Direct Hessenberg reduction 

For the reduction of a nonsymmetric matrix to Hessenberg form, unblocked and blocked 
algorithms have been known. 

4.2.1.1 Unblocked direct Hessenberg reduction 

Let A G R^^^ be a nonsymmetric matrix. Find Householder transformations Qi, Q2, ■ ■ ■ , 
Qn-2 such that each Qi is designed to introduce zeroes below the first subdiagonal in the 
i-th column of the matrix Qj_^ ■ ■ ■ Q^AQi ■ ■ ■ Qi_i [32lH5lli6l[58lfTT5] . 
Then H = Q%^2 ' ' ' Qi^Qi ' ' ' Qn-2 is a Hessenberg matrix. 

For a Householder Q = (I—uu^), where || u ||2= Q^AQ = {I —uu^y A{I —uu^) = 

A — uu^A — Auu^ + uu^Auu^ = A — uv'^ — {Au — uv^u)u^ = A — uv'^ — wu'^, where 
= u^A and w = Au — uv'^u. Hence Algorithm 14.11 

Algorithm 4.1. Unblocked direct Hessenberg reduction 
Input: An N x N matrix A 

Output: A is overwritten with a Hessenberg matrix similar to A 

for A; = I, - - ,A^-2 do 

Find a Householder vector u, ||^i||2= to zero the elements 

below the subdiagonal in the k-th column of A; 

v'^ = u^A; 

w = Au — {v'^u)u; 

update A = A — uv^ — wu^ ; 
endfor 
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As shown in p9], the total flops required by the above algorithm is Ylk=i 4A^(A^ — k) + 
EfcLi 4(A^ - kf ^ lOA^VS. The computations of u'^A and Au - {v^u)u require A in 
column major order and row major order respectively. Therefore, if > 5, the algorithm, 
as it is presented, needs 0{N^) I/Os. However, a matrix transposition in each iteration 
can bring the cost down to 0{N^/B) I/Os. 

To access each column/row if the matrix is in row/column major order takes seeks. 
So the seek complexity of the algorithm is 0{N^) if the algorithm is implemented as it is 
presented. But with a matrix transposition in each iteration, the seek complexity would 
be 0{N^/y/M + N^), which is 0{N^/y/M) as N > Vm. 

The main drawback of this algorithm is that all its operations are V-M or V-V. Blocking 
has been shown to improve on this I/O performance [32t H5 | H6]. 

4.2.1.2 Blocked direct Hessenberg reduction 

A blocked version of the above algorithm has been suggested [32lll5lll6] using aggregated 
Householder transformations [23l|96]. We describe and analyse this algorithm now. 

Let A be the input matrix, and let Ai = A. Find a Householder vector ui with 
||wi||2= l/"\/2 to reduce the first column of Ai. Let Vi = Aju and Wi = AiUi — UiU^AiUi. 

Initialise U = [ui], V = [vi], and W = [wi]. 

Then A2 = Ql^AiQu, =A^- Uivj - WiuJ = Ai - UV^ - WU^ . 

The second column of A2 can be computed like this: 02 = Ai[*,2] — f/y"^[*,2] — 
iyt/"^[*,2]. Find Householder vector U2 of euclidian norm l/\/2 for reducing 02. Let V2 = 
Alu2 = {Ai-UV'^ -WU^)^U2 and W2 = A2U2-U2V2U2 = {Ai-UV^ -WU^)u2-U2V^U2. 
Then A3 = Ql^A2Qu-, = A2 - U2V^ - W2ul = Ai - UV^ - WU^ - U2vl - W2ul = ^1 - 
( f/ «2 ) ^ - ( W2 ) ^ . Set f/ = ( f/ U2),V =[V t;2 ) and = 

[W W2). 
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That is, A3 can be computed without exphcitly computing A2; 02 is enough. If we 
continue hke this for k columns, the aggregates U, V and W can be used to update the 
rest of the matrix as in: 

Ak = Ai- UV^ - WU'^ (4.1) 

We would have computed Ak without explicitly computing A2, . . . , Ak-i. See Algorithm 14. 21 
below: 



Algorithm 4.2. Slab based direct Hessenberg reduction fj^ 
Input: Matrix A of size N x N. 

Output: A is overwritten with a similar Hessenberg matrix. 
Let A = A 

for i = 1 to N/k do 

Find a Householder vector u with ||m||2= 1/"\/2 to reduce 

A[{i -l)k + 2:N; {i - l)k + 1] to form (*, 0, . . . , 0)^; 
Compute V = A^u and w = Au — u{v'^u); 
Initialise U = [u], V = [v], W = [w]; 
for J = 2, ■ ■ ■ , A; do 

Let a, = A[*,j] - UV^[*,j] - WU^[*,j]; 

Find a Householder vector u of size N with ||u ||2= l/"\/2 to reduce 

aj[{i -l)k + j + l:N] to form (*, 0, . . . , 0)^; 
Compute V = A^u — VU^u — UW'^u; 
Compute w = Au — UV^u — WU^u — uv'^u; 
U = \Uu]; V = [V v]; W = [W, w]; 
endfor; 

for j = 1, . . . k do 

A[*; {t-l)k + j] = a,; 
endfor; 

ik + 1: N] = A[*; ik + 1 : N] - UV^[*; ik + 1 : N] - WU^[*; ik + 1 : N]; 
endfor; 



As shown in [89], the number of flops computed with matrix vector products is approx- 
imately, 2 Xlfc^i'^ A^(A^ — k) ^ N'^, while the total number of flops remains nearly the same 
as that in the unblocked algorithm. That is, this algorithm spends about 30% of its work 
in level-2 operations and about 70% of work in level-3 operations. But that does not result 
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in an asymptotic improvement in I/O complexity. If is computed once in the beginning 
of each i-loop, then the total number of blocks moved by the algorithm is 0{N^/B). If 
N < M and k = Q{M/N), then the number of seeks is + If > M and 

k < Tm, then the number of seeks is 0(-^ + If > M and M > > ^M, then 

— ' ^ fcv M 111 ' — ' 

the number of seeks is + ^^j^)- Though most of the computations are performed 

using M-M operations, M-V and V-V operations still dominate the I/O. 

4.2.2 Direct tridiagonal reduction 

Let A G M^^^ be a symmetric matrix. Find Householder transformations Qi, . . . , Qn-2 
such that each Qi is designed to introduce zeroes below the first subdiagonal in the i-th col- 
umn of the matrix Qj_^ ■ ■ ■ QfAQi ■ ■ ■ Qi_i [58l[Il5]. Then T = Q^_2 ■ ■ ■ Q^AQi ■■■Qn-2 
is a tridiagonal matrix. 

For a Householder Q = (I—uu^), where ||m||2= 1/v^, Q^AQ = [I —uu^y A{I —uu^) = 
A — uv'^ — vu'^ , where v = Au — {l/2)uu^ Au. Thus, an algorithm similar to the unblocked 
Hessenberg reduction works for tridiagonal reduction. Due to symmetry, only the lower 
triangular part of the matrix needs to be updated. The number of flops and I/Os would 
be approximately half of that of the Hessenberg reduction. 

A blocked algorithm analogous to Algorithm 14. 21 can be devised for tridiagonal reduction 
too. The observations made there apply here too. The asymptotic complexity remains the 
same; that is, 0{N^/B). 

The asymptotic seek complexity remains same as the Hessenberg reduction. 

4.2.3 Direct bidiagonal reduction 

Let A G MI^^^ , and N > P. Find Householder transformations Qup ■ ■ ■ Qm, Qvi, ■ ■ ■ , Qvp^2 
such that for 1 < i < P—2, Q^- is designed to introduce zeroes below the diagonal in the i-th 
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column of the matrix ■ ■ ■ QiAQ^-^ ■ ■ ■ Qvi_^ , and Q^. is designed to introduce zeroes to 
the right of the first superdiagonal in the i-th row of the matrix Q^. ■ ■ ■ QjAQ^^ ■ ■ ■ Qvi_i- 
For P — 1 < i < P, Qu, introduces zeroes below the diagonal in the z-th column of 

Qui-i ' ' ' Qi -^Qvi ' ' ' Qvp-2' 

Let U = Qui, • • • , Qup and V = Q^^, . . . , Qvp_2- Then B = U'^AV is upper bidiago- 
nal plIHTlfTTH]. 

With the necessary matrix transpositions thrown in, the algorithm can be made to 
run in 0{NP^/B) 1/Os and 0{N P^ / ^/M + N P) seeks. The cases of < P, and lower 
bidiagonal form can be handled similarly. 

If the Householders are aggregated and applied in lots, the computation can be made 
richer in M-M operations. We now discuss one such algorithm j32|H5]. 

If u and V are two Householder vectors, then 

(/ — uu^)A{I — vv^) = A — uw^ — yv^ 

where 

y = Av, h = Al^u and w = h — {u^y)v 

Therefore, if we can compute Uj and Vj independently of each other, after reducing the 
(j — l)-th column and row of the matrix, then the Algorithm 14.31 works: 

The computation of Vj would seem to depend on Uj. In j-th step of the algorithm, the 
j-th row is altered by the left multiplication with and Q^. is to reduce this altered 
j-th row. Similarly, application of Q^^ from the right alters the [j + l)-st column of the 
matrix, which is to be reduced by Quj+i ■ 

But observe that Q^-^ . . . Qvp_2 precisely the sequence of Householders that would 
transform the symmetric matrix A to a tridiagonal form T in the algorithms of Sec- 
tion gXl That is, iiV = Qyi... Qvp_^, then V'^A'^AV = B^B is tridiagonal. 
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Algorithm 4.3. Blocked Direct Bidiagonal Reduction '45'l 



Input: A e R^""^ with N > P, and k, the slab width. 

Output: A is overwritten with a bidiagonal matrix orthogonally equivalent to A. 

Letr= \{P-2)/k]; 
for i = 1 to r do 

s= {i-l)k + 1; 

U = Y = []; /* U and Y have N rows each */ 

V = W = X = []; /*V,W, andX have P rows each */ 

for j = s to min{P — 2, s + A; — 1} do 

Let a, = A[*,3] - ?7l^^[*, j] - FV^^[*, j]; 

Compute Uj to reduce aj; 

Compute Vj by invoking Algorithm \4.4\ with Ag = A[s : N; s : P], V, X and j , 
y. = {A- UW^ - YV^)vj; hj = {A- UW^ - YV^fuj; 

'^j = - iujyj)^j' 

U = {U,u,); V = {V,vj); W = {W,w,); Y = {Y,y,); 
endfor; 

A[l: N; s + 1: s + k-l] = (a,+i, . . . , tts+k-i); 
A[s + k:N; s + k : P] = {A - UW^ - YV^)[s + k : N; s + k : P]; 
endfor; 

Compute Up-i and up; 

update A\P -l:N] P - I : P] = Q^pQup_^A[P -l:N; P -l:P]; 
Hence Vj can be computed as follows: 



Algorithm 4.4. Computation of Vj for Algorithm \4.3~l4M 



Input: A submatrix As = A[s . P; s : N] of A, V, X and j . 

Output: A Householder vector vj such that Q^^, when applied from the right, zeroes the 
elements to the right of the superdiagonal in the j-th row of As. 

z, = AUs[*,j]-XV^[*,j] - VX^[*,j] 
Compute the Householder vector Vj from zj 
t, = {A^As - V,-iX7_, - X,^^V[_,)v,; 

•^3 ~ ~ 2^^j '^jO'^i 

Xj = {Xj^i,Xj) 



If is computed once in the beginning of each i-loop, then the total number of blocks 
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moved by the algorithm is 0{NP'^/B). U N < M and k = Q{M/N), then the number of 
seeks is 0(f^+^). If > M and A; < ^/M, then the number of seeks is 0(f^+^). 
If AT > M and M > A; > VM, then the number of seeks is + ^^). 

4.3 Reduction from Banded Hessenberg Form to Hes- 
senberg Form using Householders 

The direct Hessenberg reductions described in Section 14.2.11 can be used, but after a few 
steps of those, the matrix at hand becomes full, necessitating, on the whole, 0{N^) flops 
and 0{N^/B) I/Os. Algorithms that can exploit the sparsity of the matrix can do better. 
In this section, we propose two such algorithms. 

Let A be the input matrix. Let it be of dimensions N x N and bandwidth t. 
4.3.1 A bulge chasing algorithm 

A reduction from symmetric banded form to tridiagonal form is presented in ^7\. We 
adapt this. 

Annihilate the unwanted subdiagonal elements of the band, column by column, using 
Householders. To reduce the i-th column, for 1 < i < A^ — 2, construct a Householder and 
apply it from the left and right, respectively, to rows and columns numbered i + 1 to i + t. 
This will create a triangular t x t bulge below the band in columns i + 1 to i + 1, i.e., at 
positions A[i + t + 1 : i + 2t; i + 1 : i + t]. Chase the first column of the bulge along the 
band, and out of the matrix. On the completion of the chasing, non-overlapping harmless 
(t — 1) X (t — 1) bulges will lie below the band all along the matrix. The i-th column is done. 
Now reduce the {i + l)-st column. Each t xt bulge created now will subsume the harmless 
{t— 1) X [t — 1) bulges created earlier while processing the i-th column. See Algorithm 14.51 
for a formal description. The first sweep of the algorithm is illustrated in Figure 14.11 
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Algorithm 4.5. Banded Hessenberg form to Hessenberg form using Householders 

Input: A banded Hessenberg matrix A e R^^^ of bandwidth t. 
Output: A Overwritten with a similar Hessenberg matrix. 

for i = 1,. . . ,iV- 2 do 

/*At this point, A[l : i — 1, 1 : i — 1] is a Hessenberg matrix. The band elements 
of the i-th column are handled in this iteration*/ 

Let X = mm{i + t,N}; 

Find a Householder Q such that Q^.A[i + 1 : X,i] has form (*, 0, . . . , 0)"^. 

Letk=\^]; 

for J = 0, . . . , /c — 1 do 

Let Y = min{i + (j + l)t, N}; 

Multiply A[i + jt+l:Y; i + jt + 1 : N] from the left with Q'^ . 
Let Z = min{i + (j + 2)t, N}; 

Multiply A[l : Z;i + jt + 1 : Y] from the right with Q. 
if{j < A; - 1) do 

Find a Householder Q such that Q'^.A[{i + 1) + (j + l)t : Z; i + jt + 1] 
has form (*, 0, . . . , 0)-^. 
endif 
endfor 
endfor 



The householder vectors can be stored in the lower triangular part of the matrix. All 
the Householders generated from the annihilation of the i-th column and subsequent bulge 
chasing is stored below the subdiagonal of the i-th column one after another. 

Zeroing of the i-th column needs 0{N'^/B) blocks of data to be moved. So the total 
number of blocks moved by the algorithm is 0{N^/B). 

The algorithm involves 0{N'^/t) multiplications from the left and right with Household- 
ers. Without loss of generality, assume that the matrix is in column major order. Then, 
for its right updates, the algorithm requires a total of 0{N'^/t) seeks if Nt < M (an entire 
panel of t columns fits in the main memory and can be read in with a seek), 0{NH/M) 
seeks ii t < M < Nt (M elements can be read in with t seeks), and 0{N^) seeks iit > M. 
For its left updates the algorithm requires a total of 0{N^/t) seeks in all cases. 
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Figure 4.1: Reduction of the first column and the corresponding bulge chasing steps 

A small modification to the algorithm can improve its seek complexity. Defer the left 
application of the Q matrices to the horizontal panels. In particular, while reducing the 
i-th row, repeatedly, compute a Householder Q, read the vertical panel on which it is to 
be applied, update the panel using all the pending left updates (the Householder vectors 
for these are available in the i-th column), and finally apply Q on the panel from the right. 
This avoids the need for accessing the horizontal panels. So the overall asymptotic seek 
complexity of the algorithm would be 0{N^/t) if Nt < M, 0{NH/M) ift<M<Nt and 
0{N^) iit> M. 

4.3.2 An asymptotic improvement 

The above reduction does asymptotically no better than the direct Hessenberg reduction. 
Nevertheless, a similar algorithm is popular for tridiagonal and bidiagonal reductions. The 
main drawback of the above algorithm is that to reduce a single column, the whole of the 
portion to its right is read in and written back. 
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We can improve on this if we defer chasing the bulges. We make the observation that 
it is enough to chase the bulges a short distance to make the next few columns ready for 
processing, and use it to achieve an asymptotic improvement. We maintain a box about 
the column that is being reduced, and chase the bulges to the bottom right corner of the 
box. When a number of columns are reduced we would have lined up an equal number of 
bulges at the bottom right corner of the box in a tightly packed sequence. While doing 
this, we need to update with the Householders only the portion inside the box. After lining 
up an appropriate number of bulges, we use the aggregate of the Householders to update 
the portion outside the box at one go. 

On a banded upper Hessenberg matrix A e R^^^ of bandwidth t, the algorithm pro- 
ceeds as follows: 

Divide the matrix into slabs of t columns each. Process the slabs one by one. Begin 
with the leftmost slab. 

Let w he 2t^ — t. Define the "small box" as the intersection of columns Z = ltor = l-|-w 
(left and right) and rows a = 2to6 = l-|-w-|-t (above and below). Thus, it has w + 1 
columns and w + t rows. For /i > 1, let the h-th large box be the intersection of columns 
1(h) = 1 + t + (h - l)k to r{h) = I + w + hk and rows a{h) = i + 2t + {h - l)k to 
b{h) = 1 + w + t + hk. Thus, it has k + w — t + 1 columns and rows, [k is a parameter to 
be chosen later.) 

Process the columns of the first slab one by one. Begin with the first column. Find a 
Householder Q^^^ s.t. qS^^*A[2 : 1+t; 1] is of the form (* • • • 0)^. Multiply A[2 : 1+t; 1 : r] 
and A[a : l + 2t; 2 : 1 + t] with Q^^^ = {Q?V ^om the left and right respectively. Note 
that only the portion inside the small box is involved in these multiplications. A bulge 
forms in columns 2 : 1 + i at rows 2 + t : 1 + 2t. 

Chase the leftmost column of this bulge: Find a Householder Qg^-* s.t. Q^^ * A[2 + t : 
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1 + 2t, 2] is of the form (* • • • 0)^. Multiply A[2 + t : 1 + 2t] 2 : r] from the left, and 

A[a : 1 + 3t; 2 + t : 1 + 2t] from the right with Q^2^ = {Q^2^)^ ■ bulge forms in columns 

2 + i : 1 + 2i at rows 2 + 2t:l + 3t. 

Chase the leftmost column of this new bulge similarly, and continue like that till the 
right boundaries of the latest bulge and the small box coincide. This happens after the 
{2t — l)-th Householder. Nail the bulge in its present position. 

The nailed bulge is unreduced and hence spans t columns, whereas each of the other 
(harmless) bulges has lost one column and so spans t — 1 columns. 

Now process the second column, and chase the bulges similarly, until the latest bulge is 
t — 1 columns away from the nailed bulge of the first column. This involves finding of (2^—3) 
Householders. Note that the bulges created while processing the second column subsume, 
when they are as yet unreduced, the harmless bulges left behind by the processing of the 
first column. 

Process the t columns of the slab like this. When all are done, we will be left with t 
nailed bulges spanning columns t + 1 to 2t^ — t + 1, and any two consecutive nailed bulges 
will be separated by a harmless bulge that spans t — 1 columns. 

Note that the slab and the interspersed sequence of the nailed and harmless bulges are 
tightly packed inside the small box. 

The Householders found till now have been applied only to the relevant portions of 
the small box. Now we form groups of these Householders, aggregate each group into an 
(7 + YTY'^) representation and apply from the left on the portion to the right of the small 
box, and apply from the right on the portion above the small box. (The details of the 
aggregation are given later.) 

Next consider the first large box. The nailed bulges of the small box now occupy the 
top left corner of this box. Move them into the bottom right corner, one by one, starting 
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with the rightmost bulge, and arrange them there in a similar tightly packed manner. 
This process would leave harmless bulges each spanning t — 1 columns in the small box. 
In particular, columns t + 2 to 2t hold the harmless remnant of the bulge introduced by 
the reduction of column t. But column t + 1 is now without a bulge, and so is ready for 
processing. The Householders found during this process are applied on the fly only to the 
relevant portions of the first large box. Now group them, aggregate the groups and apply 
the aggregates on portions outside the first large box. (The details of the aggregation are 
given later.) 

We repeat the process for the remaining large boxes in a left to right order, until the 
nailed bulges are all within the last k + w columns of the matrix. 

Next we chase bulges from the bottom right corner of the last large box, and out of 
the matrix. The matrix is now identical to the one produced by the first t iteration of 
Algorithm 14.51 applied on the original input. 

Now the first slab has been processed. Continue with remaining slabs, until at most k + 
w — t+ 1 rightmost columns of the matrix remain to be reduced. Then invoke Algorithm 14. 51 
on the unreduced square submatrix at the bottom right corner. 

Now the matrix will be in Hessenberg form. 

A more formal description of the algorithm follows: 

Algorithm 4.6. Banded Hessenberg form to Hessenberg form using bulge packing 

Input: A banded upper Hessenberg matrix A G M^^^ of bandwidth t. 
Output: A is overwritten with a similar Hessenberg matrix. 

Let w = 2t^ - t; 

for (z = 1; i < \{N ~ k ~ w + t - l)/t]t; i = i + t) do 

Let the small box be the intersection of columns l{i) = i to r{i) = i + w and 
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rows a{i) = i + 1 to h{i) = i + w + t. 

Thus, it has w + 1 columns and w + t rows. 

for {j = i; j < i + t — 1] j ++) do /* process the j-th column */ 
Initialise Gi . . . G2{t-i)+i to identity matrices of appropriate dimensions. 
Find Householder Q'f^ s.t. Q^f'' * A[i + 1 : j + t; j] has form (* ■ ■ ■ 0)^. 
Aggregate Qi^ into Gi. 

Multiply A[j + l:j + t;j: r{i)] and A[a{{) : b{i); j + + t] with 
qU) j,pQjy^ if^Q i^ji Qjifi Yight respectively. 

A bulge forms in columns j + 1 : j + t and rows j + t + 1 : j + 2t. 
Let Xj = 2 * (t — (j — i + 1)) + 1. Then Xj — 1 is the number of times 
this bulge will be chased before it's fixed in the small box. 
for {z = 2; z < xj; z ++) do 

Find s.t. Q^P * A\3 + - l)t + 1 : j + zt, j + {z - 2)t + 1] 
has form (* • ■ ■ 0)"^. Aggregate Q^P into Gz. 

Multiply A[i + {z -l)t + l : i + zt; j + {z-2)t + l: r(i)] from the left, and 
A[a{i) : j + {z + l)t; j + {z ~ l)t + 1 : j + zt] from the right with Qi^K 
A bulge forms in columns j + {z — l)t + 1 : j + zt and 
rows j + zt + 1 : j + {z + l)t. See Remark below. 
endfor /* Nail the bulge at the present position */ 
endfor 

Multiply A[a{i) : b{i) — t; r{i) + 1 : A^] (right of the small box) from the left, 
and multiply A[l : a{i) — 1; l{i) : r{i)] (above the small box) from 
the right with (Gi . . . G2(t-i)+i)"^ (ind (^2(4-1)+! • • • Gi) respectively. 
See Algorithm |^.7| 
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Let the h-th large box be the intersection of 

columns l{i, h) — i + t + {h — l)k to r{i, h) — i + w + hk and 

rows a{i, h) — i + 2t + {h — l)k to b{i, h) — i -\- w -\- 1 -\- hk. 

Thus, it has k + w — t + \ columns and rows. 

k is a parameter to be chosen later. 

for(/i=l; r{i,h)<N~t; h++) do 

/* move the bulges to the bottom right of the h-th large box */ 
for(j = i; j<i + t-l] do 

Initialise Ghk/t+2t-i ■ ■ ■ G{h-i)k/t+2 to identity matrices of appropriate dimensions. 
xj^2*{t-{j -i + 1)) + 1; 
for {p — 1; p < k/t; p++) do 

Let z = p + {h — l)k/t + Xj. /* for each column j we pick up the 
z- count from where the previous box (large or small) left off */ 

Find Q^P s.t. qP *A[j + {z-l)t + l:j + zt; 3 + {z- 2)t + 1] 
has form (* . . . 0)^. Aggregate with Gz. 

Multiply A[j + {z -l)t + l : j + zt; j + {z - 2)t + 1 : r{i, h)] 
from the left, and A[a{i, h) : j + + j -\- [z — l)t -\- 1 : j -\- zt] 
from the right with . A bulge forms in columns 
j + {z — l)t -\- 1 : j -\- zt and rows j -\- zt -\- 1 : j -\- {z -\- l)t. 
endfor 
endfor 
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The smallest and largest values assigned to z in the above are 
[h — l)k/t + 2 and hk/t + 2t — 1, when (p = 1, j = i + t — 1) and 
(p = k/t,j = i) respectively. See Algorithm \4.S\ 

Multiply A[a{i, h) : b{i, h) — t; r{i, h) + 1 : N] (right of the h-th large box) 
from the left, and multiply A[l : a{i, h) — 1; l{i, h) : r{i, h)] (above the h-th large box) 
from the right with {G^h-i)k/t+2 ■ ■ ■ Ghk/t+2t-iY 
and {Ghk/t+2t-i ■ ■ ■ G(^h-i)k/t+2) respectively. 
See Algorithm \4.S\ 
endfor 

Chase the bulges that are now at the bottom right corner of the {h — l)-th large 
box out of the matrix. (We omit the details.) 
endfor 

Invoke Algorithm^ on A[N + I - [ ^^~^~"'+*^ ]t : A^; N+l - ^ {N-k-w+t) ^^ . 
Aggregate the Householders into groups after all t columns have been reduced 
Update corresponding columns of submatrix 

A[l:N- N+l - ^iJt±p£+tli^t , ]v] from the right. 



Remark: The bulge that forms in columns j + {z — l)t + l : j + zt and rows j + zt + 1 : 
j + {z + l)t subsumes any harmless elements that may have been present there. Columns 
j + (z — l)t + 1 : j + zt cannot have elements below this bulge; column j + zt has been 
cleared of any bulge before proceeding to this step. A snapshot of the matrix after applying 
the bulge introduction phase to the first slab {t = 3) is shown in Figure 14.21 In the figure, 
the submatrix A[i + 1 : i + 2t^; i : i + 2t'^ — t] = A[2 : 19; 1 : 16] is updated from both the 
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Figure 4.2: Introduction of t = 3 tightly packed bulges of size 3x3 after reducing first 3 
columns. 

left and right, the submatrix A[i + 1 : i + 2t^ - t; i + w + 1 : N] = A[2 : IQ; 17 : A^] is 
updated from the left and the submatrix A[l : i; i + 1 : i + w] = A[l : 1; 2 : 16] is updated 
from the right. 

4.3.2.1 Aggregation: the bulge introduction phase 

Let Q^^^ be the Householder that reduces the j-th column, and for 2; > 1, let Q^P be the 
Householder that reduces the first column of the z-th bulge induced by j-th column. Qi^^ 
acts along dimensions j + 1 + {z — l)t to j + zt. 

Consider the processing of the slab beginning at column i. 

The Householders are constructed and applied inside the small box in the following 
order: Q^^^ . . . Qx]Qi'^^'^ ■ ■ ■ ■ ■ ■ Qi^^ But while updating the relevant portions 

outside the small box they need not be applied in the same order. We can aggregate 
them into different groups using a technique proposed in [78], because of the following 
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Figure 4.3: Householders constructed to introduce m = 3 tightly packed bulges after 
reducing first block(first 3) column. 

observations: (See Figure IT2] and Figure 

1. Qi^^s, which reduce the j-th column and the bulges induced by it, act along disjoint 
dimensions. That means, if Qi^^s are to be applied on a submatrix only from one 
side (the left or right), then they can be applied in any order. 

2. The dimensions of Q^J^ and Qy^^^ overlap iS y = z or y = z + 1. Therefore, Q^P can 
be applied only after Qi^ and Q^J^i^ ■ 

3. Qz s are to be applied in the order of their construction. Let Gz be the aggregation 
of Q^z^^s in the order of construction. 

4. If the Gz^s are applied in the decreasing order of z, then the above two requirements 
will be met. 

The computing of (Gi . . . G2[t~i)+iVA[a{i) : b{i)-t; r(z) + l : A^] and A[l : a(i)-l; : 
r(z)](G2(t-i)+i • • • Gi) can be done as follows. Note that Xi = 2{t — 1) + 1, Gxi = Q'x} and 
Gxi-i = Q^xi-i- ^ slab only the first column is subjected to more than Xj — 2 chases. 

102 



CHAPTER 4 



Hessenberg, Tridiagonal, and Bidiagonal Reductions 



G^,_2 is an aggregation of Q^La and Qf^_2- Gxi-3 is an aggregation of Q^lg and Q^-g. 
In a slab only the first two columns are subjected to more than Xi — A chases. Continuing 
like this, we find that for 1 < z < Xi, if q — Xi — z + 1, then 1 + [{q — 1)/2J is the number 
of Householders aggregated into G^, which therefore operates on y — t + [{q — 1)/2J rows 
and columns. 
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The computation can be done as follows: 



Algorithm 4.7. Multiplying with the Left and Right Aggregates from Bulge Introduction 

Input: Aggregations of the {2t — 1) groups of the left and right Householders. 
Output: Updated matrices. 

for {z — xf, z > 1; z ) do 

q = Xi - z + 1; 

y = t+[iq-l)/2\; 

Multiply A[i + {z-l)t + l:i + {z- l)t + y; r{i) + 1 : N] 

from the left with 
Multiply A[l : a(i) - 1; i + (z - l)t + 1 : i + (z - l)t + y] 

from the right with G^ 
endfor 
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4.3.2.2 Aggregation: the bulge chasing phase 

In the bulge chasing phase of the algorithm, repeatedly, t bulges that are nailed at the 
top left corner of a large box in a tightly packed sequence are chased one by one over a 
distance of k. This would leave them at the bottom right corner of the large box, which is 
also the top left corner of the next large box. Each bulge is, thus, chased k/t times from 
its current position. See Figure HiH 

During the chasing steps, the Householders found are applied on the fly only to the 
relevant portions of the box. Afterwards, we group the Householders, aggregate the groups 
and apply the aggregates on portions outside the box. 

Consider the processing of the slab beginning at column i. Suppose we are at the h-th 
large box of this slab. Let a = Xi + {h — l)k/t + 1. 

The Householders are constructed and applied inside the box in the following order: 



Qa^ • • • Qiik/t-iQt-? ■ ■ ■ Qttk/t-s ■ ■ ■ Qa~it+2 ■ ■ ■ Qt'^lt+k/t+v But whllc Updating the rel- 



evant portions outside the box they need not be applied in the same order. The observations 
made during the bulge introduction phase hold for these Householders too. Hence these 
too can be grouped and aggregated the same way. That is, all Q's with the same subscript 
can be grouped together, and aggregated in the order in which they are constructed. Note 
that the smallest and largest subscripts are a — 2t + 2 and a + k/t — 1 respectively. See 
Figure 14. 5[ 

For a fixed h, as j varies over + t — 1], and as p varies over [1, k/t], 



where D{p, j) = p — 2j + 2i — l. Note that each z corresponds to a group. When z is fixed, 
D{p,j) is also fixed, as a is independent of p and j. 




+ 2t]+p-2j + 2i~l = a + D{p,j) 
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Figure 4.4: Chasing of 3 tightly packed bulges. 
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Figure 4.5: Householders constructed to chase 3 tightly packed bulges. 
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Consult the above table. For -{2t - 2) < D < k/t - 1, set {{j,p)\D{p, j) = D} 
corresponds to Ga+D- If -D > 0, then the leftmost cell with value D is {i,D + 1). 

Otherwise, the leftmost cell {j,p) is (i + \{—D)/2'\,b), where 6 = 1 if D is even, 2 if is 
odd. If is of value D, then the next cell of value D to the right is (j ' + l,p + 2). 

Qi^^ operates on range [j + {z - l)t + 1 : j + zt] = at + {j + Dt) + [-t + 1 : 0]. 

For a given z, and D = z — a, lei E{D) he the number of Householders aggregated into 
Gz- The range of Gz then spans E{D) — l+t dimensions starting at at+Dt+{—t+l)+j*{D), 
where j*{D) is the smallest row j that contains D. 
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We assume that 2t - 2 < k/t. U D < 0, then E{D) = [{D + 2t)/2\. U D >0 and 
2t-l + D <k/t, then E{D) = t. U D > and 2t- 1 + D > k/t, then E{D) = f 1(| - D)] . 

Therefore the updations can be done using the following algorithm: 



Algorithm 4.8. Multiplying with the Aggregates from Bulge Chasing 

Input: The k/t + {2t — 2) aggregations of the left and right Householders. 
Output: Updated matrices. 

for {z = hk/t + 2t-l; z > {h - l)k/t + 2; z ) do 

D^z- ^^i^ - 2t; 

IfD>0, then j* = i; else j* = (i + \{-D)/2]). 
IfD<0, then E = [{D + 2t)/2\; 
else if2t-l + D < k/t, then E = t; 
elseE= fi(f -D)]. 

L = zt + j* -t + 1; U^L + t + E-2; 
Multiply A[L : U ; r{i) + 1 : N] from the left using Gz', 
Multiply A[l : a{i, h) — 1; L : U] from the right using G^; 
endfor 



4.3.2.3 I/O complexity 

We first calculate the number of blocks of data moved by the algorithm. 

Without loss of generality, we assume that t < \fM. In the previous chapter, we found 
that the I/O complexity of banded Hessenberg reduction is 0{N^ / B\fM) \l t > \fM. 
Therefore, there is no advantage to choosing t > y/M there. 

Each bulge introduction phase processes a small box of size (w+t) x (w + l) — ©(i^). To 
process a column of the slab, on an average, half of the small box needs to be updated. This 
amounts to a total of Q{t^/B) blocks of data movement. At the end of a bulge introduction 
phase for the slab beginning at the i-th column of the matrix, awx {N — i — w) submatrix 
is to be updated from the left and a i x (ty + 1) submatrix is to be updated from the 
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right. These involve the application of the group aggregates one by one. Each group 
aggregate applies on at most 2t — 1 dimensions. For no z do the ranges Gz and Gz+2 
overlap. Therefore, these updations require each updated element to be handled at most a 
constant number of times, and so would incur 0{{w{N — i) + wi)/B) — 0{Nt^ /B) blocks 
of data movement. There are Q{N/t) bulge introduction phases. Over all of them, the 



Now consider bulge chasing phases. 

Each bulge chasing phase processes a large box of size (k + w — t + l) x{k + w — t + l) — 
Q{{k + wy/B). To process a column of the slab, more than half of the large box needs to 
be updated. This amounts to a total of Q{{k + w)H/B) blocks of data movement. At the 
end of the h-th bulge chasing phase for the slab beginning at the i-th column of the matrix, 
a (1^ — 2^ + ^+1) X {N — i — w — hk) submatrix is to be updated from the left and an 
{i-\-2t + hk — k — l) X {w — t-\-k) submatrix is to be updated from the right. These involve 
the application of the group aggregates one by one. Each group aggregate applies on at 
most 2t — l dimensions. For each z, the ranges of and Gz+2 begin at least 2t — \ apart. 
Gz aggregates at most t Householders {E{D) <t). Gz spans at most E{D) + t — l < 2t — l 
dimensions. That is, the ranges of Gz and Gz+2 cannot overlap. Therefore, these updations 
require each updated element to be handled at most a constant number of times. That 
means the processing of each large box would incur a total of O ^ (^+"') t+N{w+k) ^ blocks 
of data movement. This is continued for a Q{N/k) times to reduce a slab of t columns. 
There are N/t slabs. When k — Q{w) — 6(t^), the data movement is minimised, and is 
0{^-^ + j^)- The cost of chasing the bulges out of the matrix from the last large box 
cannot be larger than the above. So the overall data movement cost of the algorithm is 



Now we calculate the number of seeks needed. Assume without loss of generality that 




blocks. 
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the matrix is kept in column major order. With k — Q{t^) the large and small boxes are 
all of G(t^) size. Therefore, we analyse only the large boxes. There are 0{N'^/t^) large 
boxes overall. 

There are 6(t^) inside-the-box right updates oit'^ xt vertical slabs. Each takes t seeks 
if < M and f^/M otherwise; in the latter case, we read an {M/t) x t submatrix in t 
seeks; there /M such submatrices in a vertical slab. The total number of seeks incurred 
by these right updates is O^N"^) lit^ <M and 0{NH/M) otherwise. 

There are Q{t^) inside-the-box left updates ol t x horizontal slabs. These can be 
clubbed with the inside-the-box right updates. Since < M, we can keep all the House- 
holder vectors generated to chase a particular column in-core. When a vertical panel is 
accessed for right updates, first apply all the pending left updates, then the right update. 

For each large box, the algorithm needs to perform a left update of the right side of the 
box. This is performed as a series of left updates with Householder aggregates of horizontal 
panels of size t x N. Each update of the series takes N seeks, for a total of 0{Nt) seeks 
per large box. However, if < M, then all the aggregated Householder groups of a box 
can be kept in-core. All the left updates of a large box can therefore be performed 0{N) 
seeks: read the columns of the slab one by one, update and write back. Thus, the total 
seeks over all large boxes is 0{N^/t^) if < M, 0{N^/t^) otherwise. 

Analogously, a right update of the portion above the box is also performed. The apph- 
cation of a Householder aggregate to an iV x t vertical slab requires one seek if Nt < M {t 
consecutive columns of the matrix can be accessed in one seek), and Nt'^/M otherwise {t 
seeks to access an {M/t) x t segment of the panel; Nt/M such segments). Thus, the total 
seeks over all large boxes is 0{N'^/t^) if Nt < M, and 0{N^/M), otherwise. 

Combining all of the above, we find that the seek complexity of the algorithm is: 

0{N^ + N^/t^) if t^ < M, and 0{N^/t^) otherwise. 
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In the above analysis we assumed that the boxes do not fit in the main memory. But if 
t = 0(M^/^), then a 0(t^ x t^) box can be kept in-core. If we repeat the analysis with this 
assumption, we find that the total I/O complexity of the Algorithm is 0{^^ + I?)" 

4.3.2.4 The two step Hessenberg reduction 

In Section 14.2.11 we saw that both the unblocked and blocked reductions of a full non- 
symmetric matrix to Hessenberg form cost 0{N^/B) I/Os. The I/O performance can be 
improved if the reduction is split into two steps: 

1. Reduce the matrix to banded Hessenberg form 

2. Reduce the banded Hessenberg matrix to Hessenberg form 

The first step, we saw in Chapter 3, takes 6 (j^^ I/Os if t < ^/M and 6 ^ J^^ ^ if 
t > v^M, where t is the band width. We analysed the second step above. The second step 
dominates the I/O complexity. 

We conclude the following: 

• li N > M^M, choose t = Q{VM). The total I/O cost will be O (^;^). That 
is, for large values of A^, our algorithm matches the lower bound on number of data 
movement for the problem [13] . 

• If < < Ma/M, then choose t = e(A^^/^), the total I/O complexity will be 
O (]vT7Tb)- (We assume that M^/^ < B.) 

• If Vm < N < M^/^, then choose t = e(M^/^). The total I/O complexity will be 
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Thus, our two stage reduction process improves the asymptotic I/O complexity of Hes- 
senberg reduction by the following factors, over the direct single step unblocked and blocked 
reductions. 

• a factor of Vm, if > M^M 

• a factor of A^^ if M^/^ < < mVM 

• a factor of y/M/B, if Vm <N < M^/^ 

4.4 The Reduction of a Symmetric Banded Matrix to 
Tridiagonal Form 

Direct reduction of a full symmetric matrix to tridiagonal form was discussed in Subsec- 
tion |4221 If the matrix is a symmetric banded one then these reduction algorithms are not 
the methods of choice, as they don't take advantage of the sparsity of the banded matrix. 
Furthermore, after a few reduction steps of these algorithms, the matrix becomes full with 
the bulges created, resulting 0{N^) flops and 0{N^/B) I/Os pHITTtlTS]. 

The banded structure of the matrix makes it possible to employ the bulge chasing 
technique |211|55l[72l[77ll95l[97] . We briefly discuss a few of these algorithms. 

The algorithm of [HZ] chases a bulge as soon as it forms using Jacobi rotations. On an 
N X N matrix of semi-bandwidth t the algorithm proceeds as follows: Annihilate element 
ait+i using a Jacobi rotation Utt+i along dimensions t and t + 1. Because of symmetry, it 
is enough to consider the elements above the diagonal. The rotation introduces a bulge at 
'2t2t+i- Annihilate this bulge using a rotation U2t2t+i- A bulge forms at a2t3t+i- Repeat 
this process until the bulge is chased out of the matrix. Now start the next iteration with 
the annihilation of au. Once all the unwanted elements of the first row are reduced, move 
on to the next row. The algorithm returns a tridiagonal matrix when all the N — 2 rows 
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are reduced. To reduce any row, the rest of the matrix has to be read and written twice. 
So total I/O complexity of the algorithm is 0{N'^t/ B). 

To access each row costs 0{t) seeks, so the seek complexity of the algorithm is 0{N'^t). 

This algorithm was improved in [72] by incorporating vector operations on vector ma- 
chines. The bulges are chased in a delayed fashion. Several bulges are introduced before 
any is chased out of the matrix. The bulge created by an element is chased and placed 
at a safe distance before the next element of the band is annihilated. Once a number of 
elements have been annihilated and bulges formed along the diagonals at a distance of t 
from each other, vector operations are invoked to chase several bulges simultaneously. 

Though this algorithm uses vector operations to exploit the locality, its asymptotic I/O 
complexity is the same as that of the previous algorithm, namely 0{N'^t/ B). Similarly the 
asymptotic seek complexity remains same as that of the previous algorithm. 

Algorithms have been designed using Householders instead of rotations too [771195]. 
These impart more data locality to the problem. We discuss the algorithm of [95] . 

This algorithm processes the columns of the input matrix A (symmetric, banded with 
a semi-bandwidth of t) one by one. When the j-th column is taken up for processing, the 
first (j — 1) columns (and rows) would have already been reduced. So the leading j x j 
submatrix of A would be in tridiagonal form. The unwanted elements of the j-th column 
(the band elements aj+2ji ■ ■ ■ , o-j+tj) are annihilated at one go by applying a Householder 
on rows and columns numbered from j ' + 1 to j ' + t, from the left and right respectively. 
This will introduce a triangular bulge in columns j ' + 1 to j ' + 1 at rows j + 1 + 1 to j ' + 2t. 
(We talk only of the lower triangular part.) If this submatrix is QR-decomposed and the 
resultant Q matrix is applied to A from the left and right, the whole of the bulge would 
vanish, but a new bulge would appear t positions below to the right. This step is continued 
till the bulge is chased out of the matrix. The algorithm moves on to the next column. 
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The reduction of the j-th column involves O {^^j^) bulge chases, each of which causes a 
QR update that costs Q{t^ / B) I/Os. So the I/O complexity of the algorithm is 0{NH/B). 

The problem with the above algorithm is that it chases the whole of a t x t bulge away 
as soon as it forms. But it is sufficient to chase the first column of the bulge away, for the 
next column to be ready for reduction. The remaining part of the bulge will be subsumed 
by bulge created by the next column, and therefore is harmless for the correctness of the 
algorithm. This observation led to the parallel algorithm of |77] . 

This algorithm is similar in structure to the above one. Hence the details are omitted 
here. Each chase here involves a single Householder, rather than an aggregate of t House- 
holders. This reduces computation. However, the asymptotic I/O complexity remains 
0{NH/B). 

To reduce each column, the leading submatrix is partitioned into t x t blocks. So the 
seek complexity of the algorithm is 0{{tN/y/M + N)N) = 0{tN'^/^/M + N'^). 

4.4.1 The two step tridiagonal reduction 

In Subsection 14.2.21 we saw that both the unblocked and blocked reductions of a full 
symmetric matrix to tridiagonal form takes 0{N^ /B) I/Os. The I/O performance can be 
improved if the reduction is split into two steps: 

1. Reduce the matrix to symmetric banded form 

2. Reduce the symmetric banded matrix to tridiagonal form 

The first step, we saw in Chapter 3, takes 9 (^-^j I/Os if t < \/M and O ^ if 
t > \/M, where t is the semi-bandwidth of the symmetric banded form. We analysed the 
second step above. 
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We calculate the combined I/O complexity of the two steps, and pick the t that optimises 
this quantity. We conclude the following: 

• If > M, choose t = 0(yM). The total I/O cost will be O I/Os. That is, 
for large values of A^, our algorithm matches the lower bound on the data movement 
for the problem [13] . 

• li < N < M, choose t = y/N. The total I/O cost will be O (^) • 

• If < choose t = B. The total I/O cost will be will be 0{N'^). 

Thus, our two stage reduction process improves the asymptotic I/O complexity of tridi- 
agonal reduction by the following factors, over the direct single step unblocked and blocked 
reductions. 

• a factor of ^M, if > M 

• a factor of A^l, if < A^ < M 

Note that the seek complexity of the first step is 0{N^ /kt^) if only one tile fits in the 
main memory and 0{N^/t^) otherwise. (See Chapter 3.) This is inconsequential. 

4.5 The Reduction of a Banded Matrix to Bidiagonal 
Form 

Let A G M^^^ be a banded matrix with a lower band width of t and an upper band width 
of m, i.e., A{i,j) = ioi i > j + t or i < j — m. Without loss of generality, we assume 
that N > P, and that A is to be reduced to upper bidiagonal form; the case of A^ < P 
is analogous, and only a minor modification is necessary to reduce A to lower bidiagonal 
form. 
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Figure 4.6: The first sweep in the reduction of a banded matrix A G M^^^^^ with lower 
bandwidth t = 3 and upper bandwidth m = 2 to upper bidiagonal form using rotators. 

The direct bidiagonal reductions discussed in Section l4.2.2l are not the methods of choice 
when the input is a banded matrix. Those algorithms, when applied on a banded matrix, 
fill the matrix with unwanted entries in a few steps, and therefore fail to take advantage 
of its sparsity [2ll[7Tl[78]. The LAPACK library [7j contains a rotation based bidiagonal 
reduction algorithm that chases a bulge along the band and out of the matrix as soon as 
it is formed. One sweep of this algorithm is demonstrated in Figure 14. 6[ In the Figure, 'x' 
denotes an unchanged element, 'u' denotes an element that is modified during the current 
sweep, 'b' denotes an element of a bulge and '0' denotes an element that is zeroed during 
the current sweep. This algorithm starts by annihilating at+n using a rotator utt+i along 
rows if: and t + 1. Applying of this rotator generates a bulge at position (tt + m + l) above 
the band. This bulge is chased immediately using a right rotation Vt+mt+m+i on columns 
t + m and t + m + 1. A new bulge forms at position {2t + m + It + m) below the band. 
Chase it too, and continue like this until a bulge is chased out of the matrix. The cost 
of this is 0(min{A^, P}) flops and O (min{A^, P}/i?) I/Os. The next sweeps annihilate 
ati, . . . , (321 in that order. After this Oim+i, • • • , ^is are eliminated in that order. Now the 
first row and the first column are done. Continue for 0(min{A^, P}) times to reduce the 
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Algorithm 4.9. Reduction of a banded matrix to bidiagonal form using Householders 1 791 
Input: A handed matrix A G M^^-'^, N > P of bandwidth t 
Output: A is overwritten by the upper bidiagonal matrix. 

for j = 1, . . . , P do 

LetX = A[j:N; j : P]; x = P - j + 1; 

Partition X as shown above; 

Find a Householder U^^ for reducing An; 

An := f/F^.An; 

for /c = 2, . . . , kmax do 

Ak-ik '■= u\^li-Ak-ik; 

Find a Householder V^f^ for reducing Ak-ik 's first row 

Ak^ik '■= Ak-ik-Vf^ ; Akk '■= Akk-Vk ' 

Find a Householder uj,''^ for reducing Akk 's first column 

Akk •= U^\Akk; 
endfor 
endfor 

matrix to upper bidiagonal form. The reduction takes 0(min{A^, P}^(t + m)) flops and 
0{imxi{N,PY{t + m)/B) 1/Os. 

The seek complexity of the algorithm is 0(min{A^, -P}^(t + m)). 

In [6Tl[78l[79j , parallel and sequential algorithms are proposed for the bidiagonalisation 
of a banded matrix using Householder transformations. We discuss an adaptation of an 
algorithm from [79j. This algorithm is similar to the Householders based tridiagonalisation, 
we saw earlier and is described in Algorithm 14. 9[ 

Given a banded matrix A G M^^-'^ of bandwidth t with N > P, this algorithm repeatedly 
performs the following partition of a matrix X G where x = P — j + 1: 

• the flrst block column consists of the flrst column of X 

• for 2 < k < kmax = \{x — l)/{t + m)] + 1, the k-th block column consists of columns 
{k-2){t + m) + 2 to {k-l){t + m) + l of X. 

• the fcmax-th block column consists of columns {kmax — 2)(t + m) + 2 to x of X. 
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Figure 4.7: The first sweep in the reduction of a banded matrix A G 1^22x20 lower 
bandwidth t = 3 and upper bandwidth m = 2 to upper bidiagonal form using Householder 
transformations. 



• the first block row consists of rows 1 to t + 1 of X 

• for 2 < A; < k^ax = \{x — + m)] + 1, the k-th block row consists of rows {k — 2){t + 
m) + t + 2 to (A; - + m) + t + 1 of X. 

• the kmax-ih block row consists of rows {kmax ~ 2)(t + m) + t + 2 to x of X. 

Let Xj j denote the tile that is the intersection of the i-th block row and j-th block column 
of X. 



The first sweep of the algorithm is demonstrated in Figure UTTl In the Figure 'x' denotes 
an element that was nonzero before the sweep, 'b' denotes an element of a bulge, and the 
small boxes inside tile denote the elements that are to be zeroed during the sweep. 



117 



CHAPTER 4 



Hessenberg, Tridiagonal, and Bidiagonal Reductions 



This algorithm takes 0(P^(t + m)) flops and 0{P'^{t + m)/B) I/Os. In general, bidiag- 
onal reduction using this algorithm incurs 0(min{A^, P}^{t + m)/B) 1/Os. 

To reduce each column, the leading banded matrix is partitioned into blocks which takes 
0{mm{N, P}{t + m)/\/M + min{A^, P}) seeks. So the seek complexity of the reduction 
algorithm is 0(min{A^, P}'^{t + m)/y/M + min{A^, P}^). 

4.5.1 The two step bidiagonal reduction 

Without loss of generality, we assume that the banded matrix is upper banded. Let t 
be its bandwidth. The Algorithm 14.91 which takes a banded matrix of nonzero lower and 
upper bandwidths can easily be modified to handle upper banded matrices; the asymptotic 
complexity remains the same. 

In Subsection I4.2.3[ we saw that both the unblocked and blocked reductions of a full 
matrix to bidiagonal form takes 0{N^/B) I/Os. The I/O performance can be improved if 
the reduction is split into two steps [6T | [78 t [79]: 

1. Reduce the matrix to banded form 

2. Reduce the banded matrix to bidiagonal form 

The first step, we saw in Chapter 3, takes O ^^[ILEEdEiI!!^ I/Os if t < ^/M and 
O ^ ^-P^^W-P} j I/Os if t > -\/M, where t is the bandwidth of the banded matrix. We 
analysed the second step above. 

We calculate the combined I/O complexity of the two steps, and pick the t that optimises 
this quantity. We conclude the following: 



If min{iV, P} > VM and max{A^, P} > M, choose t = Q{VM). The total I/O cost 
will be O j • That is, for large matrices, our algorithm matches the lower 

bound on data movement for the problem [13] . 
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• If Vm < P,N < M, choose t = e{^meiK{N, P}). The total I/O cost will be 

Q f uan{N,P}NP \ 
\ ^max{Ar,P}B J ' 

Thus, our two stage reduction process improves the asymptotic I/O complexity of bidi- 
agonal reduction by the following factors, over the direct single step unblocked and blocked 
reductions. 

• a factor of VM, if min{A^, P} > \fM and max{iV, P} > M 

• a factor of -y/maxliV, P}, otherwise if ^/M < P,N < M 

The seek complexity of the two stage reduction process is 0(min{A^, P}^+iVP min{A^, P} / (M\/M)) 
with t y/M, if min{Ar, P} > and max{Ar, P} > M, and 0{imn{N, P}Vmax{Ar, P}/VM) 
with i ymax{Ar, p}, if < P,N <M. 
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The Generalised Eigenvalue Problem 

5.1 Introduction 

Many scientific and engineering applications [98 | [99 |[T0Hll05] instantiate the Generafised 
Eigenvalue Problem (GEVP), where we seek to find, for two matrices A,Be C^^^, every 
A that for some nonzero v satisfies 

Av = XBv (5.1) 

The GEVP becomes the standard eigenvalue problem (EVP) when B = I. A nonzero 
V that satisfies Equation 15.11 for some value of A is called an eigenvector of the ordered 
pair {A, B), and A is called the associated eigenvalue; A and v together form an eigenpair. 
Av = XBv implies that {XB — A)v = 0, which in turn implies that A is an eigenvalue of 
{A, B) if and only if the characteristic equation 

det{XB -A) = Q 

is satisfied. det(Ai? — A) is a polynomial of degree N or less and is called the characteristic 
polynomial of the pair {A, B). The form XB — A is called a matrix pencil. 

The pair (A, B) has finite nonzero eigenvalues, counting multiplicity, iff rank{B) = N. 
If B is rank deficient, then the N eigenvalues of {A,B) include O's and oo's |115] . The 
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matrix pair {A, B) is said to be singular if A and B are singular. The pair is called regular 
otherwise. 

Two pairs of matrices (A, B) and (A, B) are said to be equivalent if there exist nonsin- 
gular matrices U and V such that A = UAV and B = UBV, which implies A — XB = 
U{A — XB)V. Equivalent pairs have the same characteristic equation, and hence the same 
eigenvalues. An equivalence transformation converts a pair into an equivalent one. 

If matrices A and B are symmetric, B is positive definite, and for an upper triangular 
R the Cholesky decomposition of B is RR^ , then (A, R'^v) is an eigenpair of R~^AR~^ 
iff (A,f) is an eigenpair of {A,B). If A and B are symmetric, A is positive definite, and 
for an upper triangular R the Cholesky decomposition of A is RR^, then {1/X, R'^v) is 
an eigenpair of R~^BR~^ iff (A, f ) is an eigenpair of {A,B). That is, in either case, an 
equivalence transformation with U = R~^ and V = R~^ reduces the GEVP instance into 
an EVP instance with a symmetric coefficient matrix, which can then be solved using either 
the symmetric QR-algorithm, Jacobi method or the Bisection method for the symmetric 
EVP [58 | [87 t[TT5] . Efficient out-of-core Cholesky factorisation is discussed in [T6] . 

If S is a nonsingular matrix, then (A, v) is an eigenpair of [A, B) iff (A, v) is an eigenpair 
of B^^A iff {X,Bv) is an eigenpair of AB^^. If A is a nonsingular matrix, then (A, f ) is 
an eigenpair of {A,B) iff (1/A,f) is an eigenpair of A^^B iff {l/X,Av) is an eigenpair of 
BA~^. Therefore, most instances of the GEVP can be reduced to the EVP. But, for a 
number of reasons, this may not be the best way to solve the GEVP | 115 ]. Suppose B 
is nonsingular, and consider the reduction to the EVP on AB^^. If B is ill conditioned, 
then the eigenvalues of the computed AB~^, even if they are well conditioned, can be poor 
approximations to the eigenvalues of the pair {A,B). Even if A and B are symmetric, 
AB~^ may not be symmetric and it might be desirable to preserve symmetry. Even if 
A and B are sparse, AB~^ may not be, and a computation with it could be therefore 
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expensive. Thus it is useful to develop direct algorithms to solve the GEVP. 

Schur theorem for the EVP guarantees that for every matrix A G C^^^ there exists 
a unitary matrix U G C^^^ such that T = U*AU, where T is upper triangular. That 
means A is unitarily similar to T. The analogous result for the GEVP is called "the 
Generalized Schur Theorem" and is stated as follows |115j : "Let A,B & C^^^ . Then 
there exists unitary Q,Z E C^^^ and upper triangular T, S E C^^^ such that Q*AZ = T 
and Q*BZ = S. Thus Q*{A — XB)Z = T — AS"'. That means {A, B) is unitarily equivalent 
to a triangular-triangular pair (T, 5*). If (T, S) could be found, then the eigenvalues could 
be read off as T[fc, k]/S[k, k]Joi I < k < N . 

A matrix T G M"^" is called quasi-triangular if it has the block upper triangular form 



where each main diagonal block is either 1 x 1 or 2 x 2, and each 2x2 block has com- 
plex eigenvalues. If A and B are real matrices then the real counterpart of the Wintner- 
Murnaghan Theorem applies. It is stated as follows: "If A and B are in R^^^, then there 
exist orthogonal matrices Q and Z such that Q^AZ is upper quasi-triangular and Q^BZ 
is upper triangular." That means {A, B) is orthogonally equivalent to a triangular-quasi- 
triangular pair. 

Abel's Impossibility Theorem shows that there is no general formula for the roots of a 
polynomial equation of degree iV, if > 4. Consequently, there is no general formula for 
the eigenvalues of an A^ x A^ matrix, if A^ > 4 jll5] . No algorithm can find the equivalence 
transformations guaranteed by Schur Theorem in a finite number of steps. Therefore, as 
in the case of EVP, we need iterative algorithms to solve GEVP. 

Such algorithms for the GEVP are known, and typically proceed in two stages. In the 
first stage, the input pair {A, B), both matrices of which could be full in general, is reduced 
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first to an equivalent full-triangular (F-T) pair, and then to an equivalent Hessenberg- 
Triangular (H-T) pair. In the second stage, an iterative method called QZ algorithm [58t 
[86llll5[[117] is applied to the reduced matrix pair. The first stage takes 0{N^) flops in 
these algorithms. But these algorithms are rich in vector-vector (V-V) and vector-matrix 
(V-M) operations, and not in matrix-matrix (M-M) operations. 

To improve the performance, it has been proposed that the reduction in the first stage 
be split into two steps [HISTllTO], the first step reducing a full-full (F-F) pair to an F- 
T pair and then to a banded Hessenberg- Triangular pair (BH-T) [Tl [3514371 170] . and the 
second step further reducing it to an H-T pair [UISTIITO]. All reductions use equivalence 
transformations. Though the modified first stage takes more flops, its richness in M-M 
operations, makes it more efficient on machines with multiple levels of memory. 

For reducing an F-T pair to a BH-T pair, tile based algorithms are known [Tl 1351136] . 

In this chapter we focus on the first stage. We analyse some known algorithms for their 
I/O complexity. We present a new algorithm too. 

In all the algorithms we discuss, it is assumed that we begin with an F-T pair, as 
the transformation from an F-F pair to an F-T pair can be achieved through a QR- 
decomposition. 

5.1.1 Organisation of this Chapter 

In Section 2, we study some of the unblocked direct H-T reductions, and their I/O per- 
formances. Reduction to H-T pair using the slab approach is discussed in Section 3. We 
discuss the tile based reduction algorithms for F-T pair to BH-T pair in Section 4. Reduc- 
tion of BH-T pair to H-T pair is discussed in Section 5. 
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5.2 Unblocked Reduction from Full- Triangular Form 
to Hessenberg- Triangular Form 

See Algorithm 15.11 This algorithm was first presented in [86]. It uses Givens rotations. 
As can be readily seen, it requires 0{N^) flops. Use of fast Givens rotations here can 
approximately halve the computational costs j 84j . 

Algorithm 5.1. Reduction from F-T form to H-T form 186] 
Input: Matrices A,B E M^^^, where B is upper triangular 

Output: Orthogonal matrices Q, Z E R^^^ such that {H, T) = {Q'^AZ, Q^BZ) is an 
H-T pair. H and T overwrite A and B, respectively. 

Remark: Gi G R^^^ denotes a Givens rotation acting on rows/ columns i — 1 and i. 1^ 
stands for the identity matrix of order N. 

Set Q In; Z 1^; 
for J = 1 to - 2 do 

for i = A^ to J + 2 step —1 do 

Construct Gi such that the {i,j)-th entry of GfA is zero; 
Update A 4- GfA, B ^ GjB, Q ^ QG^; 
Construct Gi such that the {i,i — l)-th entry of BGi is zero; 
Update A ^ Ad, B <- Bd, Z ^ Zd; 
end for 
end for 



The total amount of data the algorithm needs to move, measured in blocks, assuming 
that Q and Z need not be explicitly constructed, is O {N^/B). Explicit construction of Q 
and Z matrices also requires to move O (N^/B) data blocks. 

Assume that the matrices are stored in column major order and that N > B. Then 
accessing a row will require a seek per element. Therefore, the seek complexity of the 
algorithm is O ( "^fSi Yl!i=j+2 ^ ^ j) — 0{N^), for an overall I/O complexity of 0{N^). 
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5.2.1 An 0(1) improvement 

A drawback of the above algorithm is that the Givens rotations are applied as soon as 
they are constructed. The performance could be improved if a number of rotations are 
accumulated and applied together. The algorithm of [70] does that; the rotators that 
reduce A[j + 2 : A^; j] (respectively, annihilate the subdiagonal bulges in B created by 
them) are not applied on A and B as soon as they are formed, but are accumulated then 
applied together from the left (resp., right). See Algorithm 15.21 



Algorithm 5.2. Reduction from F-T form to H-T form ^70j 
Input: Matrices A,B & M.^^^ , where B is upper triangular 

Output: Orthogonal matrices Q, Z G R^^^ such that {H, T) = {Q^AZ, Q^BZ) is an 
H-T pair. H andT overwrite A and B, respectively. 

Remark: row-givens and column-givens apply a sequence of Givens rotations to the rows 
and columns of a matrix respectively. 

Set Q ^ In, Z 1^ 
for J ^ l,2,...,A^-2 do 

Construct a sequence of Civen rotations G^, 
row- givens (Gn, • • • , Gj+2, A); 
row- givens (Gn, • • • , Gj-^-2, B ); 
column- givens (Gn, • ■ • , Gj+2, Q ); 
Construct a sequence of Civen rotations Gn, 

bulges in B; 
column-givens (G N , • • • , Gj+2, B ); 
column-givens (Gn, ■ ■ ■ , Gj+2, A); 
column-givens (Gn, ■ ■ ■ , Gj^2, Z); 
end for 



The total amount of data moved remains 0{N^ /B), as the computation involved is the 
same. But the seek complexity is now improved. This can be seen as follows: 

Consider the case of A^ < M/ 4. Without loss of generality, assume that the matrices are 
in column major order. Maintain the sines and cosines of the accumulated rotators in the 
main memory. Load a slab of Q[M/N) columns of the matrix in the main memory, update 



. . . , Gj^2 to reduce A{j + 2 : N; j); 



. . . , Gj+2 to annihilate the subdiagonal 



126 



CHAPTER 5 



The Generalised Eigenvalue Problem 



the relevant portions of this slab and write back. All this takes 0(1) seeks. Q{N^/M) 
(resp., Q{LN/M)) slabs have to be handled for a left (resp., right) multiplication, where 
L is the number of rotators accumulated. Note that the average value of L is G(A^) over 
the iterations. Therefore, the number of seeks is at most * (N'^/M) = 0{N^ /M). The 
I/O complexity is, thus, 0{N^/B). 

The above will not work, if > M/4. In that case, we can still transpose the matrix 
before and after every left application so that it is done on a row major form. The trans- 
positions, over the iterations, cause a net data movement of 0{N^/B) blocks and seeks 
that number 0{N^ /\/M). The rest of the algorithm will now incur 0{N^/M) seeks. The 
I/O complexity of the algorithm is, thus, 0{N^ / B + N^/y/M) = 0{N^/B), if B^ < M. 

5.3 Slab Based Reduction from Full- Triangular Form 
to Hessenberg- Triangular Form 

All the operations of the above algorithms are of V-M. Data locality can be further im- 
proved |70] by incorporating M-M operations using the techniques proposed in [80] , which 
we now describe. 

Let denote the Givens rotation that annihilates the {i,j)-th element of A. Note 
that for k > 1, and Gp'*''^'' can commute if > ii + k. This observation allows 
us to regroup the rotators as follows: Divide A into slabs of b columns each. Con- 
sider the slab that spans columns j + 1, . . . , j + b. The rotators formed for the slab are 
GSi+^\ . . . , g54V\ • • • , Gg+^\ . . . , C'^/^f^l^. (See Figure O Here A^ = 13 and 6 = 4. The 
figure shows the rotators generated for the first four columns; J = 0.) If the rotators are 
written as in the figure, they form of a trapezoid of height b and parallel sides of length 
N — j — 2 and N — j — b — 1. Form groups by repeatedly cutting off rhombuses of side 
length b from the right end of the trapezoid. The rhombuses will be limited on the left 



127 



CHAPTER 5 



The Generalised Eigenvalue Problem 





^(1) 






Org 






Crg 


^(1) 






^(2) 


^(2) 




^(2) 






^(2) 


^(2) 


^(2) 


<-^13 






r(3) 


Lrg 






Lrg 






1 *-^13 


^12 


^11 


*-^10 


^(4) 

(jg 




G?) 









Figure 5.1: Sequence of Givens rotations used to reduce the first 4 columns of a 13 x 13 
matrix A. 

by the left boundary of the trapezoid. Each rhombus corresponds to a group. The groups 
of a slab can be applied in the left-to-right order. The aggregate of each group will be a 
matrix of dimensions 2b x 2b at the most. 

The right side rotators can be grouped the same way. 

The algorithm of [70] uses the above grouping, and proceeds as follows: Assume that j 
columns of A and B have been reduced. Partition the matrix pair like this: 



A= ^■+^ 

N-j-l 



Au A12 Ai3 
A22 A23 



B 



Bii B12 

522 



i+1 

N-j-l 



Consider the next b columns (that is, ^22). Reduce the columns one by one. When the 



rotators for each column are found, apply them cumulatively, as in Algorithm 15. 2[ to ^22 
and -B22 but not to A23. After this, accumulate the rotators into groups as described above. 
Next find the right rotators that would reduce the bulges in B just formed. Apply them 
cumulatively to ^22,^23 and B22, but not to Ai2,Ai3 and B12. After this, accumulate 
these rotators into groups. When all the b columns are reduced, aggregate each of the 
left side and right side groups into a. 2b x 2b matrix. Apply the left side aggregates in the 
correct order to A23 from the left. Apply the right side aggregates similarly to A21, A^i and 
B12 from the right using level-3 operations. Now j + b columns of A and B are reduced. 
Proceed with the next slab of b columns. 
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A rotator can be applied on a 26 x 2b matrix kept in row major order in 0{b/B) I/Os. 
A group has at most b"^ rotators. Therefore a group can be aggregated in 0{b^/B) I/Os. 
An aggregate can be applied on the pairs in 0{bN/B) I/Os, if 6 = 0{\/M). 

For 1 < k < [(A^— 2)/6], the fc-thslab produces \{N—{k—l)b)/b'\ aggregates. Therefore, 
the total number of I/Os spent in aggregation is 0{N%/ B) and the total number of I/Os 
spent in applying them is 0{N^ /bB). This is an improvement, but it is the rest of the 
algorithm that dominates the asymptotic I/O complexity. For example, the total cost of 
all left multiplications on B is 0{N^/B), as in Algorithms 15.11 and 15.21 

5.4 Tile Based Reduction from Full- Triangular Form 
to Banded Hessenberg- Triangular Form 

The above algorithms are rich in V-V and V-M operations. To improve the performance, 
it has been proposed that the reduction from F-T form to H-T from be split into two 
steps [n[2| [53H5S| [7n] ■ the first step reducing an F-T pair to a banded Hessenberg- Triangular 
pair (BH-T), and the second step further reducing it to an H-T pair. In this section, we 
discuss a tile based algorithm for the first step. This algorithm is similar to Algorithm 13. 2[ 

5.4.1 An algorithm for t < y/M 

Without loss of generality, assume that is an integer multiple of t. Partition the matrix 
pair (A, B) into N/t x N/t tiles Aij and Bij of size t x t each. Assume that a tile- 
transposition has been performed, and the tiles are available in row major order. See 
Algorithm 15.31 This is an adaptation from [70] . 

As shown in [35], annihilations of the subdiagonal blocks and the fill-in elements can 
be done using various ways, i.e., using Givens rotations or Householder transformations 
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or combination of Householders and Givens rotations. We use Householders to reduce 
sub diagonal blocks. 

The functions "QR-decompose" , "QR-Update", "QR-Multiply-from-left-1", and "QR- 
Multiply-from-right-2" were described in Chapters 2 and 3. We now describe the other 
functions. 

Funct ion RQ- decompose ( A j ) : 

Assume that the tile Bij is divided into horizontal slabs of width k each. Let r — t/k. The 
invocation RQ-decompose finds matrices Yi, . . . ,Yr,T-i, . . . ,Tr such that Rij — Bij{It + 
YiT^Y^) ... (It + YrTjl^Y^) is upper triangular. This is done as follows: 

Read Bij into the main memory. Proceed in r iterations. At the beginning of the s-th 
iteration, the subdiagonal elements of the first {s — l)k rows of Bij are all zero. In the s-th 
iteration, calculate Yg and Tg such that the subdiagonal elements of Bij{It + YgTjYj') in 
the first sk rows are all zero. Update Bij to Bij{It-\-YsTjYj). Store Yg in the subdiagonal 
part of the s-th horizontal slab of B^j. 

When all slabs are processed, arrange the upper triangular part of B^j in column major 
order and lower triangular part in row major order, and write B^ back to the disk. 

The number of I/Os needed is 2t^ /B; Bij is read and written; the Bij and T matrices 
are kept in the main memory till the right updates of the corresponding block columns of 
A and B are finished. 

The number of seeks is even for the one tile case. 
Function RQ-Update(Sjj_i, j) 

Here -Bjj-i and Bjj are upper triangular, the invocation RQ-Update(i?j finds 
a matrix Q such that in BQ^ , every element of the (j, j — l)-th tile is zero, and the 
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Algorithm 5.3. A tile based reduction from F-T form to BH-T form 

Input: Matrices A,BE: M.^^^ , where B is upper triangular, and t, the desired bandwidth 

Output: Orthogonal matrices Q, Z e M^^^ such that [H, T) = (Q'^AZ, Q'^BZ) is a BH-T 
pair. H and T overwrite A and B, respectively. 



for z = 1 to N/t - 1 do 

for J = i + 1 to N/t do 
QR-decompose{Aj j); 

for k = i + 1 to N/t do, QR-Multiply-from-left- l{Aj i, Aj j^) , endfor; 

for k = j to N/t do, QR-Multiply-from-left-l{Aji,Bjk), endfor; 

/* Bjj fills out forming a bulge */ 
endfor; 

for j = N/t to i + 2 step -1 do 

QR-Update{Aj-ii, Aji); 

for = i + 1 to N/t do, QR-Multiply-from-left-2{Aj i, Aj Aj_i k) , endfor; 

for k = j — I to N/t do, QR-Multiply-from-left-2{Aj i, Bj k, Bj^ik), endfor; 

/* -Bjj-i fills out forming a bulge */ 
R Q- decompose{ Bj j-i); 

/* Bjj-i is now upper triangular */ 

for A; = 1 to N/t do, RQ-Multiply-from-right-l(Bj j^i, Ai j^i), endfor; 
for k = I to j — 1 do, RQ-Multiply-from-right-l{Bj j-i, Bij_i) endfor; 
RQ-decompose{Bj j ) ; 

/* Bjj is now upper triangular */ 

for k = I to N/t do, RQ-Multiply-from-right-l{Bj j, Aij) endfor; 
for = 1 to j — 1 do, RQ-Multiply-from-right-l{Bjj,Bij) endfor; 
R Q- Update{Bj j _ i , Bj j ) ; 

/* Now Bjj-i is zero and Bjj is upper triangular */ 

for k = \ to N/t do, RQ-Multiply-from-right-2{Bj j^i, Aij_i, Aij) endfor; 
for k = I to j — I do, RQ-Multiply-from-right-2{Bj j-i, Bij-i, Bij) endfor; 
endfor; 

/* A bulge remains in -Bj+i j+i */ 
RQ-decompose{BiJ^i j+i); 

for = 1 to N/t do, RQ-Multiply-from-right-l{Bi^ii^i, Aii^i) endfor; 

for k = 1 to i do, RQ-Multiply-from-right-l{Bi^ii^i,Bii^i) endfor; 
/* Now B is upper triangular, again */ 
endfor; 
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{j,j)-th tile is upper triangular. This is done over r — t/k iterations. The s-th iteration 
processes Bj^\ the s-th rightmost vertical slab of width k of Bjj. Consider the matrix Cg = 

Bjj^i Bff y Let (^^'^ and be matrices such that (^Ik+t + (^^" ^ Tj { Yj h ) 
has zeroes in the leftmost t columns of its s-th bottom-most horizontal slab. Then, 

O(t-sjt) 



( Bjj-i Bjj ) 



l2t + 



( 0{t-sk) h ^{s-i)k ) 



h 

too zeroes in the leftmost 2t — sk columns of its s-th bottom-most horizontal slab. Set 
( Bjj_i Bjj ) to this matrix. The s-th slab of Bjj^i is now all zero. 

The above can be implemented as follows: Read Bjj^i. Read Bjj one vertical slab at a 
time. For each slab, update the slab, write it back and overwrite the corresponding hori- 
zontal slab of Bjj_i with the Y matrix. The number of I/Os is at most 3t^/B, irrespective 
of whether one tile or two tiles fit in the main memory: one tile and a half are both read 
and written. 

Since both the tiles are upper triangular, with their nonzero elements stored contiguously 
during the RQ-decomposition, the number of I/Os is 2t^ /B: two half tiles are both read 
and written. The Y matrices is kept in-core for the future right updates. 

The number of seeks is 0{t/k). 



Function RQ-Multiply-from-right-l(i?jj, Dkj): 

Invocation of this function after an RQ-decomposition multiplies tile D^j {D here can be 
A or B) from the right with Q^. 

Implementation: Read horizontal slabs of D^j into the main memory, update using the 
in-core Y and T matrices and write back. The number of I/Os is 2t^/B: read D^j and 
write back. 
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The number of seeks is 0{t/k). 
Function RQ-Multiply-from-right-2(Sj Aj-i, Aj): 

Invocation of this function after an RQ-Update over Bjj^i and Bjj updates Dij^i and Dij 
{D can be A or B) from the right using Q^. The computation proceeds as follows: 

Read Dij^i. Read a vertical slab from Dij and the corresponding horizontal slab from 
Bjj^i. Update the vertical slab and -Dzj-i using the horizontal slab, and write the vertical 
slab back. When all slabs have been processed, write Dij^i back. If only one tile fits in 
the main memory, then the number of I/Os is 5t^/B: A j-i a-nd Dij are read and written; 
Bjj-i is read. If two tiles fit in the main memory, then -Bjj-i can be kept in the main 
memory, as I varies, and so the number of I/Os is 4t^/B. 

The number of seeks is 0{t/k) if one or two tiles fit in the main memory; 0(1) if three 
tiles fit in the main memory. 

5.4.1.1 Analysis of I/O and seek complexities 



The I/O complexity of the one tile implementation is ^^'^^^ + ^ ( )■ "^^^ com- 



If four or more tiles fit in the main memory, then the two single tile RQ-decompositions 
can be avoided. Instead of invoking RQ-decompose on Bjj^i and Bjj separately and then 
applying RQ-Update on them together, we can RQ-decompose ( -Bjj-i Bjj ) directly. The 
resultant Y and T matrices can be kept in the main memory and while updating the 
{j — l)-th and j-th tile columns. This saves a httle on the cost. Note that in each i-loop, 
there will be one single tile RQ-decompose; that of Sj+ij+i. 



If one or two tiles fit in the main memory, the seek complexity is O If three or 

more tiles fit, the seek complexity is O ( ^ J . 




plexities of the two and three tile implementations are 



9.833Ar3 
tB 
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As discussed in earlier chapters, the single, two and three tile implementations assume 
that + Akt < M, 2t^ + d,kt < M, and St^ + 2kt < M respectively. The parameter k is 
usually kept small. 

The number of I/Os required to handle the T matrices is O ^^^^^^^j; as in the other 
tile based algorithms. 

5.4.2 Reduction to the Banded Hessenberg- Triangular form when 
p > 2 tiles fit in the main memory 

We now briefly describe an algorithm that is applicable when p > 2 tiles fit in the main 
memory (SHEO], and then analyse it for its I/O and seek complexities. 

Let {A, B) be an F-T matrix pair. Let ^(i,p) denote the submatrix of A formed by the 
p bottom-most tiles of its i-th tile column. QR-decompose A(i,p) into Qi-Ri- Multiply the 
bottom- most p tile rows of A and B with from the left. A pt x pt bulge forms in the 
bottom-right corner of B. Let B denote the submatrix of the bulge formed by excluding 
its topmost tile row. 



Suppose R2Q2 is an RQ-decomposition of B. Then BQ2 = R2- Let Q2 = {Qj Q 



where Q3 comprises of the first tile row of Q2- QR-factorise into, say, (55(-D^ ... 0)"^, 
where D G M*^* is a diagonal matrix in which each diagonal element is ±L Note that 
here Q5 is a product of t Householder transformations, and hence can be represented as 
Q^ = I + where T5 G M*^* and K, G M^*^*. 



BQ5 





V J 





V / 



D-i = BQlD-^ = BQl 







V 







This implies that the leftmost tile column of B can be annihilated by applying to 
the last p tile columns of B. The rest of the bulge is not annihilated; it stays on as a 
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harmless bulges. Now the bottom most p — 1 tiles of the first tile column of A are done. 
Reduce the next set of p — 1 tiles. This would first introduce a pt xpt bulge adjacent along 
the diagonal to the harmless one left behind earlier; the first column of this bulge again 
can be zeroed. Like this when the whole of the first column of A is reduced a sequence of 
harmless bulge would have appeared along the diagonal of B. These are called harmless 
because when the next tile column of A is reduced, the bulges created will subsume these. 
When all the tile columns of A are reduced and A becomes banded Hessenberg, B would 
have become triangular again. 

5.4.2.1 Analysis of I/O and seek complexities 

The number of flops required by the algorithm is 0{N^) [70]. 

Let us assume that the main memory can accommodate two panels of size ptxt together 
with one and a half tiles (to store T and temporary values); i.e. M > {2p + 3/2)t^. 

To RQ factorise a. {p — l)t x pt matrix, for each tile row, starting from the bottom-most, 
reduce it and, keeping the Y and T matrices in the main memory, update the other tile 
rows, until the submatrix that remains to be reduced fits in the main memory. Then it 
is read, RQ factorised in-core and written back; let this happen at the l-th tile row; then 
M = e{{p - l){p - / + 1) + (p - / + 1) + 3/2) or / = + 1) ± v/M/t2 - 3/2. Once the 
Y and T matrices are computed, the first t rows of the Q matrix can be computed easily 
without constructing the Q matrix explicitly. The number of I/Os needed by all this is: 

'{p-i)tx{p-{i-l))t t X {p - {i ~ l))t\ (M + tx{p-{l-l))t 



i=l 



(p) = ^ 2 ( > ~ _^ L ?^ yp- yt- i))L \ 



where / = (p + 1) ± ^/M/t^ - 3/2. 

Similarly, the RQ factorisation of a x pt matrix and construction first t rows of the 
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Q matrix require y{p) I/Os; 



i-i 



where Z = (2p + 3) ± ^J^Mlf^ - 5. 

The QR factorisations and the subsequent left updates require 



i=l \ j=l \ fc=i+l l=a I I i=l j=b \ k=i+l l=b 

I/Os, whereZ= [ ^^^1'"^ ], a = i + 1 + (j - - 1) and 6 = i + 1 + (Z - l)(p - 1). Here 
the first term corresponds the QR factorisations oi pt x t submatrices and the subsequent 
left updates. If the number of tiles N/t — i — 1 to be reduced in i-th tile column of A is 
not an integer multiple of p — 1, then there would a QR factorisation involving p — 1 or 
fewer tiles in that column. The second term stands for those QR factorisations and the 
left updates. 

The I/O cost of RQ decompositions and the subsequent QR decompositions and right 
updates is: 

N/t-p I Z-1 \ N/t-pN/t-b N/t-1 

E h/(p)+E^(p) + E E E 

i=l \ i = l ) i=l fc=l i = JV/i-p + l 

„ N/t-1 / i Z-2 (N/t a \\ 2 '^/'-l^/* / '^Z' \ 

+^ E E^p+E E^p+E^p +^ E E h+E2+E2 

1=1 \m=\ j = \ \k = \ l = \ j j i=l j = b \ k=l 1 = 1 J 

where Z = l ^^^l^'^ ], a^i + l + (j- l)(p - 1) and 6 = i + 1 + (Z - - 1). 

Thus the overall I/O complexity is 0{N^/tB). The seek complexity can be similarly 

shown to beO(^ + 2^). 

If pt X pt + t X pt + 31"^ /2 < M, then the RQ factorisations can be performed in-core. The 
I/O complexity of the QR factorisations and the left updates remain same as the above. 
But the RQ factorisations and the corresponding right updates cost 

N/t-p / z-2 \ N/t-p N/t-b 

R E ((P*)' + P*') + E 2((P - l)Pt' + t X pt) + - E E 2 ((c - l)ct2 + ct^)) 
i=l \ j=l / i=l fc=l 

"^^^ , (t(jV/t-0)^+t^(iV/t-i) 

z=iV/t-p+l 
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2 JV/t-l / i Z-2 /N/t a \\ ^2 N/t-lN/t / N/t j-1 

+B ^ E2P+E E2P+E2P E E 2 + E2 + E2 

1=1 \?n = l j = l \k = l 1 = 1 I j i = l j = 6 \ A: = l i = l 

I/Os where Z = [^^^^Ei^l, a = z + 1 + (j - - 1), c = A^/t - 6 - (A; - 2) and 
6 = z + l + (Z- 

The asymptotic I/O and seek complexities remain the same as before. 

Clearly it is better to do the RQ factorisations in-core. So the parameter p should be 
chosen accordingly. 

5.4.3 An improvement 

We suggest the following improvement over the above. Partition the matrix pair into 
(p — l)t X t sized rectangular tiles instead oftxt square tiles. Now run the above algorithm. 
Assume that each bulges formed fits in the main memory, and leaves room for the Y and 
T matrices of the QR factorisation and for temporary data. That is, M > p{p — 1)^2 -f 
2{p — l)t2 + Reading or writing of pt"^ elements here does not take p seeks as before, 
but only one seek. 

The I/O complexity of the algorithm is same as the previous algorithm. But the seek 
time complexity improves to O + 7?^- 



5.4.4 The case of t > VM 

Clearly, the I/Os required would be O ^ Jj^^ ^ ] run the above algorithm as it is. 

5.5 Reduction from Banded Hessenberg- Triangular Form 
to Hessenberg- Triangular Form 

Unblocked and blocked algorithms using Givens rotations are discussed in [371 [70]. We 
study these first. 
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5.5.1 Unblocked algorithms using Givens rotations 

Let [A, B) be the input BH-T pair; let t be the band width of A. First eliminate element 
{t + 1, 1) of A by applying a Givens rotation from the left on rows t and t + 1 of A and B. 
This creates a bulge at position {t + l,t) of B. Eliminate this bulge by applying a Givens 
rotation from the right on columns t and t + 1 of A and B. This forms a bulge at position 
{2t + 1, t) of A. Continue to chase this bulge out along the (t + l)-st subdiagonal of A and 
the subdiagonal of B. Now we are done with the processing of y4[)f: + 1, 1]. Repeat this with 
A[t, 1], . . . , A[3, 1], and the first column would be done. Repeat for the other columns. 

If the matrix pair is stored in row major order and B < N, then accessing any k 
consecutive elements of a row would require k I/Os. So the I/O complexity of the algorithm 
is, approximately 0{N^). 

The above algorithm accesses each row and column of A and B twice. This is because, 
each rotation is applied as soon as it is formed. [70] presents an algorithm that delays the 
applying of the rotators. In the first iteration of this algorithm, the unwanted elements of 
the first column are ehminated by applying a sequence Gtt+i, ■ ■ ■ ,G23 of Givens rotations 
from the left, where Gj j+i applies over rows i and This introduces nonzero subdiagonal 
elements at positions (t + 1, t), . . . , (3, 2) of B. 

Suppose the sequence Gt+it, Gtt-i, • • • , G32 of Givens rotations eliminate the unwanted 
elements of B when applied from the right. If these rotations were applied together to 
A, then a dense t x t bulge would form below the t-th subdiagonal of A. So, instead, 
apply the rotations one-by-one; the bulge caused by one is to be chased a distance of t 
before applying the next. For example, first apply Gt+it to A creating bulge at position 
{2t + l,t). Chase this bulge by applying only on A a rotation G2t2t+i from the left. Then 
apply the next right rotation Gtt-i creating a bulge at position {2t,t — 1). Chase this using 
a rotation G2t-i2t from left. Apply the remaining right rotations in this manner, forming a 
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Figure 5.2: Reduction of first column of BH-T pair to H-T pair 

sequence G2t2t+i, G2t-i2t, ■ ■ ■ , Gt+2t+3 of left rotations; apply this sequence together on B 
now. This shifts the subdiagonal bulges of B over t columns and rows towards the bottom 
right corner. Repeat this process till the bulges are all chased out the matrices. (These 
steps are illustrated in Figure [5^ ) Now the first column is reduced. Repeat this process 
for the other columns. 

If the matrices are in column major order, then an accumulated set of t Givens rotations 
can be applied on them from the left in 0{Nt/B) I/Os and 0{N) seeks. Forming of the 
right rotations for annihilating the consequent bulges takes 0{t^ / B) I/Os and 0{t) seeks. 
Applying of these rotations from the right and forming the left rotations for annihilating 
the consequent bulges takes 0{Nt/B) I/Os and 0{t) seeks. Therefore, algorithm runs in 
0{N^/B) I/Os and 0{N^/t) seeks. 
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5.5.2 Block algorithms using Givens rotations 

A blocked algorithm for BH-T pair to H-T pair reduction is presented in [37], which we 
now briefly describe and analyse. 

To reduce the i-th column of A, the submatrix pair {A[l : N; i + 1 : N], B[l : 
N; i + is partitioned into s = \{N — i)/t\ block columns consisting of t columns 

each. (The last block column may have fewer columns.) Let (A^, Bk) denote the fc-th block 
column pair. Ak is further divided into A^['^ made up of rows 1 to kt + i oi Ak and A^*"* 
made up of rows kt + i + 1 to {k + l)t + i of A^. Bj. is further divided into B^^^ made up 
of rows 1 to (A; — l)t + i oi B^ and -B^*"* made up of rows [k — l)t + i + 1 to kt + i oi Bj.. 
The block partitioning and subsequent reference pattern to reduce the first column of A is 
illustrated in Figure 15.31 

On the first column of A, the algorithm proceeds as follows. Find a set rowi of Givens 
rotators to annihilate the unwanted elements of the first column of A using the block 
marked (1) in the figure. Apply rowi on 5f ((2) in the fi gure). This introduces t — 1 
single element bulges in Bf\ Generate a set coli of rotators to annihilate those bulges. 
Apply coll to b[^^ ((3) in the figure). Apply both coli and rowi to A^{^ ((4) in the figure). 
Apply coll to Af^ ((5) in the figure) introducing t — 1 single element bulges in it. Generate 
a set row2 of rotators to annihilate these bulges. Apply row2 to i?2*^ ((6) in the figure), 
introducing bulges in it and generating a rotator set C0I2 for them. Apply co/2 to B2^ ((7) 
in the figure), rowi and row2 together with coli to A2^ ((8) in the figure), and C0I2 to Ag*"* 
((9) in the figure) introducing bulges and generating a rotator set row^ for them. Continue 
like this until the bulges generated by the first column are chased out of the matrices, which 
completes the processing of the first column of A. Repeat for the remaining columns. 

Assume that the matrix pair is available in column major order. To reduce any column 
i, we need to access the column panels to its right and update those using sets of rotations, 
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Figure 5.3: Block partitioning and reference pattern to reduce 1st column of BH-T pair to 
H-T pair using Givens rotations 

both from the left and right, introducing bulges and chasing them. So the number of blocks 
of data moved in the processing of the i-th column is 0{N{N — i)/B). Therefore, the total 
number of blocks moved in the algorithm is 0{N^ / B). Similarly, the total number of seeks 
is O(A^Vt). 

This procedure can be extended so that p columns are reduced and the resultant bulges 
are chased together in a super-sweep [37j. Let rowl and coll respectively denote the rotator 
sets for the z-th block row and column obtained while reducing column j. The algorithm 
proceeds in two parts. In the first part (called the reduce-part of the super sweep), the row 
and column sets 'row{.p_,jj^^ and col{.p_j for j = 1, . . . ,p are generated. After reducing p 
columns, in the second part (called the chase-part of the super-sweep), the necessary block 
column updates are performed. The chasing of the bulges and the subsequent updates 
of block columns are performed iteratively one block column ahead in a pipelined fashion 
starting with the leading block column. Asymptotic I/O and seek complexities remain 
same as above. 
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5.5.3 A new Algorithm 

The above algorithms do asymptotically no better than the direct Hessenberg- Triangular 
reduction. The main drawback of the above algorithm is that to reduce a single column of 
A, all larger numbered columns in A and B are read in and written back. 

We propose a Householder based algorithm that mitigates this drawback. Our algorithm 
reduces the columns using Householders. Each left application of a Householder forms a 
{t X t) bulge below the subdiagonal of B. The first column of the bulge is chased using an 
opposite Householder, which when applied from the right, forms (t x t) bulge below the 
band of A. A bulge in B and the subsequent bulge in A together form, what we call, a 
bulge pair. 

We make the observation that it is enough to chase the bulge pairs a short distance 
to make the next few columns ready for processing. We maintain a box pair in A and 
B about the column that is being reduced, and chase the bulge pair to the bottom right 
corner of the box pair. When a number of columns are reduced we would have lined up an 
equal number of bulge pairs at the bottom right corner of the box pair in a tightly packed 
sequence. While doing this, we need to update with the left and right Householders only 
the portion inside the box pairs. After lining up a number of bulges in A, we use the 
aggregate of the left and right Householders to update the portions outside the box pair 
at one go. 

On a banded upper Hessenberg- Triangular matrix pair A and B in M.^^^ , where A is 
of bandwidth t, the algorithm proceeds as follows: 

Divide the matrices into slabs of t columns each. Process the slabs one by one. Begin 
with the leftmost slab. 

Let w be — t. Define the "small box" pair as the intersections of columns / = 1 to 
r — 1 + w (left and right) and rows a — 2tob — 1 + w + t (above and below) of A and B. 
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A small box has w + 1 columns and w + t rows. For h > 1, let the h-th large box pair be 
made up of the intersections of columns l{h) = 1 + t + {h — l)k to r{h) = 1 + w + hk and 
rows a{h) = i + 2t + {h — l)k to b{h) = l + w + t + hkofA and B. Thus, a large box has 
k + w — t + 1 columns and rows, {k is a parameter to be chosen later.) 

Process the columns of the first slab of A one by one. Begin with the first column. 
Find a Householder gj^^ s.t. Q^^ * A[2 : 1 + t; 1] is of the form (* ■ ■ -0)^. Muhiply 
A[2 : 1 + t, 1 : r] and B[2 : 1 + t, 2 : r] from the left with Q^^K A {t x t) bulge forms 
in columns 2 : 1 + t at rows 2 : 1 + t oi B. Chase the leftmost column of this bulge 
in B, i.e., B[2 : t + 1, 2]: Construct a Householder Vi^^ using the opposite Householder 
technique [371|69l[711[in] so that B[2 : t + 1, 2] * v/^^ is of the form (* ■ ■ ■ 0)^. 

(The opposite Householder technique: Given X G M^^^, find an RQ decomposition 
X = RQ of X. Then XQ^ = R. Let be the first row of Q. Find a Householder Q such 
that Qq = sei, where s = ±1 and ei = (1 ... 0)"^. Then {XQ)ei = s{XQ){sei) = 
sX{sQei) = sXq = sX{Q'^ei) = s{XQ^)ei = sRei = jei, for a scalar 7. That is, the 
first column of XQ is of the form 761.) 

Multiply B[a : b] 2 : 1 + t] and A[a : b; 2 : 1 + 1] from the right with vl^\ Note that only 
the portion inside the small box pair is involved in these multiplications. A bulge forms in 
columns 2 : 1 + t at rows 2 + t : 1 + 2t of A. 

Chase the leftmost column of this bulge: Find a Householder Q'^^ s.t. Q^^ * A[2 + t : 
l+2t, 2] is of the form (* ■ ■ ■ 0)^. Muhiply A[2+t : l+2t; 2 : r] and A[2+t : l+2t; 2+t : r] 
from the left with A bulge forms in columns 2 + 1 : 1 + 2t at rows 2 + t : 1 + 2t oi B. 

Chase the leftmost column of this new bulge similarly, and continue like that till the 
right boundaries of the latest bulge in A and the small box coincide. This happens after 
the {2t — l)-th Householder. Nail the bulge in its present position. 

The nailed bulge in A is unreduced and hence spans t columns, whereas each of the 
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other (harmless) bulges has lost one column and so spans t — 1 columns. 

Now process the second column of A, and chase the bulge pairs similarly, until the 
latest bulge in A is i — 1 columns away from the nailed bulge of the first column. This 
involves finding of (2t — 3) left and right Householders. Note that the bulges created in A 
and B while processing the second column subsume, when they are as yet unreduced, the 
harmless bulges left behind by the processing of the first column. 

Process the t columns of the first slab of A like this. When all are done, we will be left 
with t nailed bulges spanning columns t + 1 to 2t^ — t + 1 in A, and any two consecutive 
nailed bulges will be separated by a harmless bulge that spans t — 1 columns. B has only 
harmless bulges, and these span columns t + 2 to 2t^ — 2t. 

Note that the slab and the interspersed sequence of the nailed and harmless bulges are 
tightly packed inside the small box pair of A and B. 

The left and right Householders found till now have been applied only to the relevant 
portions of the small box pair. Now we form groups of these left and right Householders, 
aggregate each group into an (7 + YTY"^) representation and apply the left Householder 
groups from the left on the portion to the right of the small box pair, and apply the right 
Householder groups from the right on the portion above the small box pair. (The details 
of the aggregation are given later.) 

Next consider the first large box pair. The nailed bulges of the small box of A now 
occupy the top left corner of this box. Recall that the small box of B has only harmless 
bulges. Move the nailed bulges of A into the bottom right corner of the first large box 
of A, one by one, starting with the rightmost bulge, by chasing them with left and right 
Householders as before and arrange them there in a similar tightly packed manner. This 
process would leave harmless bulges each spanning t — 1 columns in the small boxes of A 
and B. In particular, columns t + 2 to 2t oi A and B hold the harmless remnant of the 
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bulge introduced by the reduction of column t. But column t + 1 of A is now without a 
bulge, and so is ready for processing. The left and right Householders found during this 
process are applied on the fly only to the relevant portions of the first large box pair. 
Now group the left Householders, aggregate the groups and apply the aggregates on the 
portions of A and B to the right of the first large boxes from the left. Apply the right 
Householders on the portions of A and B above the first large boxes likewise. (The details 
of the aggregation are given later.) 

We repeat the process for the remaining large boxes in a left to right order, until the 
nailed bulges in A are all within the last k + w columns of the matrix. 

Next we chase the bulges from the bottom right corner of the last large box, and out 
of the matrix. We employe the same technique that shifts the tightly packed bulge pairs 
from the bottom right corner of one large box to the bottom right corner of the next. The 
only difference here is, instead of chasing each bulge pair exactly k/t times, each bulge 
pair is chased till it is out of the matrix pair. For example, if x chases are required for the 
rightmost bulge pair, chase the second bulge pair {x + 2) times, the third bulge pair (x + 4) 
times, and so on. Householders into groups and apply them to the top of the last box pair. 

Now the first slab of A has been processed. Continue with remaining slabs, until at most 
k + w — t + 1 rightmost columns of the matrix A remain to be reduced. Then reduce the 
remaining columns one by one, completely chasing, for each, the bulges out of the matrix, 
and perform the necessary updates. 

Now the matrix pair will be in Hessenberg- Triangular form. 

A more formal description of the algorithm follows: 
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Algorithm 5.4. BH-T pair to H-T pair using tightly coupled bulge chasing 

Input: BH-T pair A,B e R^""^. 

Output: A is overwritten with Hessenberg matrix and B is overwritten with triangular 

matrix. 

Let w^2t^ -t 

for (i = l-i < \{N -k-w-t)/t'];i^i + t) do 

Let the small box be the intersection of columns l{i) = i to r{i) = i + w and 
rows a{i) — i to b{i) — i + w + t. 
Thus, it has w + 1 columns and w -\-t rows. 

for {j — i;j < i + t — l;j + +) do /* process the column */ 

Initialise Gi . . . G2(t-i)+i o-nd Fi . . . F2(t_i)+i to identity matrices of appropriate 
dimensions. 

Find Householder Q^f^ s.t. Q^^^ * A\j + 1 : j + j] has form (* • • • 0)^. 
Aggregate Q^p into Gi. 

Multiply A[j + 1 : j + t, j : r{i)] with Qi^ from the left. 

Multiply B\j + 1: j + t, j + 1: r(i)] with Q^^^ from the left. 

A bulge forms in columns j + 1 : j + t and rows j + 1 : j -\-t of matrix B 

Find an opposite Householder V-^''^ such that the first column of 

B\j + 1 : j + t; j + 1 : j + t]* v}^^ has form (* • • • 0)^. 
Aggregate V-^-'^ into Fi. 

Multiply B[a{i) : j + t, j + 1 : j + t] with Vl^'^ from the right. 
Multiply A[a{i) : j + 2t, j + 1 : j + t] with vj^^ from the right. 
A bulge forms in columns j + 1 : j + t and rows j -\-t + 1 : j + 2t of matrix A 
Let Xj = 2 * (t — (j — i + 1)) + 1; Then Xj — 1 is the number of times the bulge pair 
will be chased before it 's fixed in a small box. 
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for {z = 2;z < Xj; z + +) do 

Find gi^^ s.t. Q'-P * A[j + {z-l)t + l: j + zt, j + {z - 2)t + 1] has form (* ■ ■ ■ 0)^. 
Aggregate qP into G^- 

Multiply A[i + (z - l)t + 1 : j + zt; j + {z-2)t + l: r{i)] from the left with qP 

Multiply B[j + {z - l)t + 1 : j + zt; j + {z - 2)t + 1 : r{i)] from the left with qP 

A bulge forms in columns j + {z — l)t + 1 : j + zt and 
rows j + {z ~ l)t + 1 : j + zt of B. See remark below 
Find an opposite Householder vj''^ such that the first column of 

B[j + {z-l)t + l: j + zt; j + {z - l)t + 1 : j + zt] * v}^^ has form (* ■ ■ ■ 0)'^. 

Aggregate Vz''^ into F^. 

Multiply B[a{i) : j + zt; j + {z — l)t + 1 : j + zt] from the right with v}"''^ 
Multiply A[a{i) : j + {z + l)t; j + {z - l)t + 1 : j + zt] from the right with vj^^ 
A bulge forms in columns j + {z — l)t + 1 : j + zt and 
rows j + zt + 1 : j + {z + l)t of A. See remark below 
endfor /* Nail the bulge at the present position */ 
endfor 

Multiply A\a{i) : h{i) - 1, r(i) + 1 : A^] and B[a(i) : b{i) - t, r{i) + 1 : N] 

(right of the small box) from the left with {Gi ■ ■ ■ G2(t-i)+i)'^ 
Multiply B[l : a{i) - 1, l{i) : r(z)] and A[l : a{i) - 1, l{i) : r(i)] 

(above the small box) from the right with {F2(^t-i)+i ■ ■ ■ -^i) 
5*66 Algorithm \5.5\ 

Let the h-th large box be the intersection of 

columns l{i, h) = i + t + {h — l)k to r{i, h) = i + w + hk and 

rows a{i, h) = i + 2t + {h ~ l)k to b{i, h) = i + w + t + hk. 

Thus, it has k + w — t + 1 columns and rows. 

k is a parameter to be chosen later. 
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for {h — 1; r{i, h) < N — t; h++) do /* move the bulges to the h-th large box */ 
for {j = i;j < i + t — 1; j ++) do 

Initialise Ghk/t+2t-i ■ ■ ■ G(^h-i)k/t+2 and Fhk/t+2t-i ■ ■ ■ F(h-i)k/t+2 to identity matrices 

of appropriate dimensions. 
Xj=2*{t-{j-t + 1)) + 1; 
for {p— l;p < k/t;p++) do 

Let z — p + {h — l)k/t + Xj. /* for each column j we pick up the 
z- count from where the previous box (large or small) left off */ 

Find Q^i^ s.t. Q^P * A\j + {z - l)t + 1 : j + zt; j + {z - 2)t + 1] 
has form (* . . . 0)^. Aggregate with G^. 

Multiply A[j + {z-l)t + l:j + zt; j + (z-2)t + l: r(i, h)] and 

B[j + {z - l)t + l : j + zt; j + {z - l)t + 1 : r{i, h)] from the left with . 

A bulge forms in columns j + {z — l)t + 1 : j + zt and 

rows j + {z — l)t + 1 : j + zt in B. 

Find an opposite Householder such that the first column of 

B[j + {z-l)t + l: j + zt; j + {z - l)t + 1 : j + zt] * v}^^ has form (* • • • 0)^. 

Aggregate Vz''^ into F^. 

Multiply B[a{i, h) : j + zt, j + {z — l)t + 1 : j + zt] and 
A[a{i, h) : j + {z+ l)t; j + {z - l)t + 1 : j + zt] from the right with Vz^^^ . 
A bulge forms in columns j + {z — l)t + 1 : j + zt and 
rows j + zt + 1 : j + {z + l)t in A. 
endfor 
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endfor 

The smallest and largest values assigned to z in the above are 
{h — l)k/t + 2 and hk/t + 2t — 1, when (p = 1, j = i + t — 1) and 
(p = k/t,j = i) respectively. See Alaorithm \5.6[ 

Multiply A[a{i, h) : b{i, h) - t; r{i, h) + 1 : N] and B[a{i, h) : h) - t; r{i, h) + 1 : N] 
(right of the h-th large box) from the left with {G(^h-i)k/t+2 ■ ■ ■ Ghk/t+2t-iY 

Multiply A\i : a(i, h) — 1; /(i, h) : r(i, h)] and B\i : a(i, h) — 1; /(i, h) : r(i, h)] 
(above the h-th large box) from the right with {Fhk/t+2t-i ■ ■ ■ F[h-i)k/t+2) 

See Algorithm \5.(k 
endfor 

Chase the tightly packed bulge pairs that are now at the bottom right corner of the 
{h — l)-th large box, one after another, out of the matrix pair. 

Aggregate the right Householders and apply to the corresponding columns of A and B. 
endfor 

Reduce the submatrix pair A[N + 1 - : A^; N + l- : A^] and. 

B[N+l- -.N- N+l - \ i^-^-^+') ^t : N]; 

Aggregate the right Householders and apply to the the submatrix pair 

A[l:N~ N + 2- : N] and 

B[l:N- : N- N + 2- : A^] 

Remark: The bulge that forms in columns j + {z — l)t + l : j + zt and rows j + zt + 1 : 
j + [z + l)t of A subsumes any harmless elements that may have been present there. 
Columns j + {z — l)t + l : j + zt cannot have elements below this bulge; column j + zt has 
been cleared of any bulge before proceeding to this step. Similarly, the bulge that forms 
in columns j + {z — l)t + 1 : j + zt and rows j + {z — l)t + 1 : j + zt of B subsumes any 
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Figure 5.4: Introduction of t = 3 tightly packed bulge pairs after reducing first 3 columns 
of BH-T pair. 

harmless elements that may have been present there. 

A snapshot of the matrix after applying the bulge introduction phase to the first slab 
{t = 3) is shown in Figure EH In the figure, the submatrices + 1 : 2 + 2t^, i : i + 2t'^—t] = 
A[2 : 19, 1 : IQ] and B[i + 1 : i + 2t^ -t, + 2t^ -t] = B[2 : 16, 1 : 16] are updated from 
both the left and right, the submatrices A[i + l : i + 2t^-t, i+w + 1 : A^] = A[2 : 16, 17 : N] 
and B[i + 1 : i + 2t'^-t, i + w + 1 : A^] = B[2 : 16, 17 : A^] are updated from the left and the 
submatrices A[l : i, i + 1 : i+w] = v4[l : 1, 2 : 16] and B[l : i, i + 1 : i+w] = B[l : 1, 2 : 16] 
are updated from the right. 

5.5.3.1 Aggregation: the bulge introduction phase 

Let Qi ^ be the left Householder that reduces the j-th column of A, and for z > 1, let Qi^^ 
be the left Householder that reduces the first column of the z-th bulge induced by j-th 
column in A. Q^P acts along dimensions j + 1 + {z — l)t to j + zt. 

Consider the processing of the slab beginning at column i of A. 
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The Householders are constructed and applied inside the small box pair in the following 
order: • . . QxiQi'^^^ ■ ■ ■ ■ ■ ■ Qi^^ from the left. But while updating the portions 

to the right of the small boxes they need not be applied in the same order. We can 
aggregate them into different groups using a technique proposed in [78], because of the 
following observations: (See Figure 15751 and Figure WM ) 

(i) 

1. Ql s, which reduce the j-th column and the bulges induced by it, act along disjoint 
dimensions. That means, if Qi^^s are to be applied on a submatrix only from one 
side (left or right), then they can be applied in any order. 

2. The dimensions of Qi"'^ and Qy~^^ overlap iS y = z or y = z + 1. Therefore, Q^P can 
be applied only after Qz and Q^J+i^ ■ 

3. Qz s are to be applied in the order of their construction. Let Gz be the aggregation 
of Q^z^^s in the order of construction. 

4. If the Gz^s are applied in the decreasing order of z, then the above two requirements 
will be met. 



The computing of (Gi . . . G2(t-i)+i)'^ A[a{i) : b{i)—t, r{i) + l : A^] and (Gi . . . ^2(4-1)+!)"^ 
B[a{i) : b{i)—t, r{i) + l : A^] can be done as follows. Note that Xi = 2(t — 1) + 1, Gxi = Q^x} 
and Gxi-i = Q^Xi-i- ^ slab only the first column is subjected to more than Xi — 2 chases. 
Gx^-2 is an aggregation of Qt}-2 and Q^Xi-2- is an aggregation of Q^x^-^^ and Q^i^Ls. 

In a slab only the first two columns are subjected to more than Xi — 4 chases. Continuing 
like this, we find that ioi 1 < z < Xi, if q = Xi — z + 1, then 1 + [(g — l)/2j is the number 
of Householders aggregated into Gz, which therefore operates on y = t + [{q — 1)/2J rows. 
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The right Householders V^'''s satisfy all the criteria discussed above in the context of the 
left Householders qI^'^'s. Therefore, aggregate them into groups Fi . . . -F2(t-i)+i; and com- 
puteA[l : a(i)-l; : r(i)](F2(t_i)+i . . . Fi) and S[l : a(i)-l; : r{i)]{F2it-i)+i ■ ■ ■ F^) 
in the same manner. 

The left and right applications of the aggregated Householders to submatrix pairs can 
be done as follows: 

Algorithm 5.5. Multiplying with the Left and Right Aggregates from Bulge Introduction 

Input: Aggregations of the {2t — 1) groups of the left and right Householders. 
Output: Updated matrices. 

for [z — Xi] z>l] z ) do 

q ^ Xi - z + I; 

y = t+\_{q-l)/2\; 

Multiply A[i + {z-l)t + l:i + {z- l)t + y; r{i) + 1 : A^] and 

B[i + {z -l)t + l:i + {z - l)t + y] r{i) + 1 : A^] from the left with 

Multiply A[l : a{i) - 1; i + {z - l)t + 1 : i + {z - l)t + y] and 

B[l : a{i) — 1; i + {z — l)t + 1 : i + {z — l)t + y] from right with 
endfor 
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Figure 5.5: Left Householders constructed to introduce t = 3 tightly packed bulge pairs 
after reducing first block(first 3) column. 

5.5.3.2 Aggregation: the bulge chasing phase 

In the bulge chasing phase of the algorithm, repeatedly, t bulges that are nailed at the top 
left corner of a large box in a tightly packed sequence in A, are chased one by one over a 
distance of k. This would leave them at the bottom right corner of the large box, which 
is also the top left corner of the next large box. In the corresponding large box of B the 
harmless bulges will lie below the diagonal. Each bulge pair is, thus, chased k/t times from 
its current position. See Figure 15.61 

During the chasing steps, the left and right Householders found are applied on the fly 
only to the relevant portions of the box pair. Afterwards, we group the Householders, 
aggregate the groups and apply the aggregates on portions outside the box pair. 

Consider the processing of the slab beginning at column i of A. Suppose we are at the 
h-th large box of this slab. Let a = Xi + {h — l)k/t + 1. 

The left Householders are constructed and applied inside the box pair in the following 
order: Q^'^ . . . ■ ■ ■ Qttl]t-3 ■ ■ ■ Q^-lt+l ■ ■ ■ Qt-lt+k/t+v Similarly the corre- 
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Figure 5.6: Chasing of 3 tightly packed bulge pairs during stage-2 of BH-T to H-T reduction 
sponding right Householders are constructed and applied inside the box pair in the follow- 



ing order- V^'^ V^'^ 1/^*"^^^ I/^*"*"^^ V 

mg Oiaei. Va ... Va+k/t-l^a-2 ■ ■ ■ Va+k/t-3 ■ ■ ■ r 



{i+t-l) 



'a-2t+2 ■■■ ^a-2t+k/t+r But while updating 
the relevant portions outside the box pair they need not be applied in the same order. The 
observations made during the bulge introduction phase hold for these left and right House- 
holders too. Hence these left and right Householders too can be grouped separately and 
aggregated the same way. That is, all Q's with the same subscript can be grouped to- 
gether, and aggregated in the order in which they are constructed. Similarly, all V^s with 
the same subscript can be grouped together, and aggregated in the order in which they are 
constructed. Note that the smallest and largest subscripts are a — 2t + 2 and a + k/t — 1 
respectively. See Figure I^TTl 

For a fixed as j varies over [i, z + t — 1], and as p varies over [1, k/t], 

{h-l)k 



+ 2t\+p-2j + 2i-l = a + D{p,j] 



where D{p, j) = p — 2j + 2z — 1. Note that each z corresponds to a group. When z is fixed. 
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Figure 5.7: Left Householders constructed to chase 3 tightly packed bulge pairs. 



D{p,j) is also fixed, as a is independent of p and j. 
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Consult the above table. For -{2t - 2) < D < k/t - 1, set {{j,p)\D{p, j) = D} 
corresponds to Ga+D- If -D > 0, then the leftmost cell with value D is {i,D + 1). 

Otherwise, the leftmost cell {j,p) is (i + \{—D)/2~\, b), where b = 1 ii D is even, 2 if D is 
odd. If is of value D, then the next cell of value D to the right is (j + l,p + 2). 

Qi^^ and v}^^ operate on range [j + {z - l)t + 1 : j + zt] = at + {j + Dt) + [-t + 1 : 0]. 

For a given z, and D — z — a,\et E{D) he the number of Householders aggregated into 
^^(or F^) . The range of ^^(or F^) then spans E{D)-l+t starting at at+Dt+{-t+l)-^j* , 
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where j* is the smallest row j that contains D. 

We assume that 2t - 2 < k/t. If D < 0, then E{D) = [{D + 2t)/2\. If D > and 
2t-l + D < k/t, then E{D) =t. If D > and 2t- 1 + D > k/t, then E{D) = [i(f-D)]. 

Therefore the updations can be done using the Algorithm 15.61 

Algorithm 5.6. Multiplying with the Aggregates from Bulge Chasing 

Input: The k/t + {2t — 2) aggregations of the left and right Householders 
Output: Updated matrices. 

for {z = hk/t + 2t- 1; z > {h - l)k/t + 2; z ) do 

D = z- - 2t; 

IfD>0, thenj* =i; else f = {i + \{-D)/2]). 
IfD<0, then E = [{D + 2t)/2\ ; 
else if2t-l + D< k/t, then E = t; 
elseE= [i(f -D)]. 

L = zt + j* -t + 1; U = L + t + E - 2; 

Multiply A[L:U] r{{) + l: N] and B[L : U; r(i) + l: N] from the left using G^; 
Multiply A[l : a{i, h) — 1; L : U] and B[l : a{i, h) — 1; L : U] from right using F^; 
endfor 



5.5.3.3 I/O Complexity 

The I/O complexity of this algorithm is at most twice that of Algorithm 14.61 of Chapter 4. 
The analysis is identical and is therefore omitted. 

5.5.3.4 The two step reduction to Hessenberg- Triangular form 

Analogous to our Hessenberg reduction of Chapter 4, the results to the last two sections 
can be combined to get an algorithm for Hessenberg- Triangular reduction with the same 
asymptotic I/O complexity. 
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5.5.4 A new algorithm using rotations 

With a little modification, the algorithm similar to the Algorithm 15.41 can be devised using 
rotations instead of reflectors. The application of rotations from left would form t — 1 
bulges along the subdiagonal of B. Chase all the t — 1 bulges using right rotations. In 
this case there would be no harmless bulges in B. Other steps are similar to the algorithm 
with the same asymptotic I/O and seek complexities. So we omit the detailed description 
and analysis here. 
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6.1 Introduction 

The QR algorithm [5 H [52 | [58 |I11H[TT5|I116] is popular for computing all eigenvalues of a 
matrix. More advantageous algorithms have been proposed for the symmetric eigenvalue 
problem [71|3ll[88], but the QR algorithm continues to be the method of choice for the 
nonsymmetric eigenvalue problem. 

The QZ algorithm [MllHSlinS] and its variants [SlISTliglinilinHlEn] are popular for 
computing all generalised eigenvalues of a matrix pair. 

In this chapter, we study these algorithms, analyse them for their I/O and seek com- 
plexities, and propose their variants that use fewer seeks overall, and fewer I/Os in some 
cases. 

6.1.1 The QR Algorithm 

Given a matrix A G C^^^, the QR algorithm iteratively computes a Schur decomposition 
T = Q*AQ of A, where Q G C^^^ is unitary and T G C^^^ is upper triangular. The 
basic QR algorithm starts with Aq = A and iterates as follows: 
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which generates a sequence of unitarily similar matrices Aq, Ai, A2 . . which under appro- 
priate conditions, converges to an upper triangular matrix [58 |lll5[ril8] . The eigenvalues 
of a triangular matrix are its diagonal elements. An iteration of the algorithm is called a 
QR iteration or QR step. 

When the QR algorithm is applied on a full matrix A, each QR step takes 0{N^) 
flops and O ^ Jj^^ ^ I/Os, as our discussion in Chapter 2 shows. If A is first reduced to 
Hessenberg form using OSTs, and then the QR algorithm is applied to the reduced matrix, 
then a QR step would take only 0{N^) flops; the Hessenberg reduction would need 0{N^) 
flops, but that is done only once. A QR step applied on a Hessenberg matrix produces a 
Hessenberg matrix. 

The explicit (single shift) QR algorithm introduces shifts in the basic QR algorithm to 
accelerate its convergence: 

Here the scalar pm-i is called a shift, and is typically chosen to approximate the eigenvalue 
which emerged at the lower right corner of the matrix. The explicit (double shift) QR 
algorithm is a further improvement, and avoids complex arithmetic of real matrices. 

The explicit variants of the QR algorithm are not used in practice, some bulge chasing 
algorithms called the implicit (shifted) QR algorithms [5Tl[52l|58] are used. In these, an 
iteration starts with an upper Hessenberg matrix, introduces alxlor2x2 bulge in the 
upper left corner of the matrix, and chases it one column at a time along the subdiagonal 
to the bottom right corner. An iteration is completed when the bulge is chased out of the 
matrix restoring it to the upper Hessenberg form. 

If after an iteration of the QR algorithm, a subdiagonal entry of the matrix is very 
near to zero, we could explicitly set it to zero. This would divide (or deflate) the EVP 
into smaller instances, which could then be conquered independently. The subdiagonal 

160 



CHAPTER 6 



The QR and QZ Algorithms 



entries can be monitored either using this classical deflation technique [5T | |52 |[TT8] or the 
more advanced aggressive early deflation technique [26l[76j. The QR algorithm is said be 
converged when all the deflated diagonal blocks are either 1 x 1 or 2 x 2 for real or complex 
conjugate eigenvalues [51 ]l52| [58] . 

In the single or double shift algorithms the bulges are 1 x 1 or 2 x 2, and are typically 
chased using rotators. So these algorithms use only V-V operations. [12j proposes the 
multishift QR algorithm that incorporates V-M and M-M operations using m > 2 shifts. 
The bulges in this algorithm are m x m. These bulges are chased using Householders, 
which can be aggregated before updations to make the algorithm rich in M-M operations. 

However the performance of the multishift algorithm is not found to be good [75 |I112I - 
I114j . This is due to a phenomenon called shift blurring: when the number of shifts is 
large, the eigenvalues of the bulge tend to get contaminated by roundoff errors during 
the bulge chasing process. A technique called the small bulge multishift method has been 
proposed [25l[671[73l[80] to address shift blurring. This method, instead of introducing an 
m X m bulge at one go, constructs a chain of small bulges of size 2x2, where each bulge 
propagates information of only one pair of shifts. The bulges of the chain are chased one 
after another starting from the lowest. 

In this chapter we briefly describe and analyse the small bulge multishift algorithm 
of [25]. We also propose a tile based variant of it that improves the seek complexity by a 
factor of m, and improves the I/O complexity if m is very large. 

6.1.2 The QZ Algorithm 

Given a pair of matrices A,B E C^^^, the QZ algorithm iteratively computes a generalised 
Schur decomposition of {A,B); that is, it computes unitary matrices Q and Z such that 
S = Q*AZ and T = Q*BZ are upper triangular. (If A, B e R^^^, then Q and Z would 
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be orthogonal and S would be quasi-upper triangular. The complex conjugate (resp., real) 
eigenvalues appear in 2 x 2 (resp., 1x1) diagonal blocks of the matrix pair {S,T).) The 
basic (explicit) QZ algorithm (with a single shift) starts with [Aq, Bq) = [A, B) and iterates 
as follows: 

which generates a sequence of equivalent pairs (Aq, -Bq), (^i, Bi), {A2, B2), . . ., which under 
appropriate conditions, converges to a triangular-triangular pair |115j . The eigenvalues of 
a triangular-triangular pair {S,T) are T[i,i]/ S[i,i], for 1 < i < N. An iteration of the 
algorithm is called a QZ iteration or QZ step. The shift is typically chosen to approximate 
the eigenvalue which emerged at the lower right corners of the matrices. 

The algorithm converges in fewer and faster iterations when applied on Hessenberg- 
Triangular (H-T) pairs. A QZ iteration applied on an H-T pair {H, T) brings H closer to 
triangular form while preserving the triangularity of T. Therefore, a pair of matrices is 
usually reduced to an H-T pair before applying the QZ algorithm. Each iteration takes 
0(A^2) flops ^hen. 

The explicit (double shift) QZ algorithm is a further improvement, and avoids complex 
arithmetic of real matrices. 

The explicit variants of the QZ algorithm are not used in practice, some bulge chasing 
algorithms called the implicit (shifted) QZ algorithms are [58l[86]. In these, an iteration 
starts with an upper Hessenberg- Triangular pair {H,T), introduces alxlor2x2 bulge 
in the upper left corner of T, and chases it back and forth between H and T, one column 
at a time to the bottom right corners. An iteration is completed when the bulge is chased 
out of the matrix pair restoring it to the Hessenberg- Triangular form jll7] . 
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If after an iteration of the QZ algorithm, a subdiagonal entry of H is very near to zero, 
we could explicitly set it to zero. This would divide (or deflate) the GEVP into smaller 
instances, which could then be conquered independently. The subdiagonal entries can 
be monitored either using this classical deflation technique [58l|86] or the more advanced 
aggressive early deflation technique [21131169]. 

In the single or double shift unblocked algorithms the bulges are 1 x 1 or 2 x 2, and are 
typically chased using rotators [371 EH [86] . So these algorithms use only V-V operations. 
Blocked versions are also known, where the application of the left and right transformations 
are delayed till required [37]. V-V operations dominate this algorithm too. An implicit 
shifted QZ algorithm using m > 2 shifts is given in [69 ]I117] . in which mxm bulge pairs are 
chased using Householders and opposite Householders from the left and right respectively. 
The multishift QZ algorithms too suffer from shift blurring. 

To mitigate shift blurring, the small bulge multishift variant of the QR algorithm can be 
extended into a QZ algorithm [69]. This method, again, instead of introducing an m x m 
bulge pair at one go, constructs a chain of small bulge pairs (of size 2 x 2 or 4 x 4 in [69]), 
where each bulge pair propagates information of only one pair of shifts. The bulge pairs 
of the chain are chased one after another starting from the lowest. 

In this chapter we briefly describe and analyse the small bulge multishift algorithm 
of [69]. We also propose a tile based variant of it that improves the seek complexity by a 
factor of m, and improves the I/O complexity if m is very large. 

6.1.3 Organisation of this Chapter 

The rest of the chapter is organised as follows: Section 2 discusses the QR algorithm. 
Section 3 is on the QZ algorithm. 
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6.2 The QR Algorithm 

In this section we consider some variants of the QR algorithm. 
6.2.1 The implicit multishift QR algorithm 

The explicit or implicit QR algorithms using one or two shifts chase 1 x 1 or 2 x 2 bulges 
and therefore are rich in V-V operations. [12] proposes the multishift QR algorithm that 
incorporates more V-M and M-M operations using m > 2 shifts. (See Algorithm 16.11 ) 

Algorithm 6.1. The Implicit Multishift QR Algorithm [TB^ 

Input: An N x N properly Hessenberg matrix H and the number of shifts m G [2, N]. 

Output: An orthogonal matrix Q G M.^^^ such that Q^HQ orthogonally similar to H , 
which is overwritten by HQ. 

Notation: Let 'Kj{x) denote a Householder matrix that maps the last N — j elements of 
vector X G to zero without modifying the leading j — 1 elements. 

Determine m shifts pi, . . . , p^; 

x = {H- pJ){H - P2I) ■■■{H- p^I)e,; 

/* Only the topmost m + 1 elements of x can be nonzero */ 

Q = ^i{x); /* Q operates on dimensions 1 to m + 1 */ 

H = Q^HQ; 

for j = 1 to - 2 do 

Q = 'Kj^i^Hcj) ; /* Q operates on dimensions j + 1 to j + m */ 

H = Q^HQ; 

Q = QQ; 
endfor 



Several ways in which the m shifts can be chosen are reported in [T2]. We do not dis- 
cuss them further. Once the shifts are chosen, the first column of 11 = {H — piI){H — 
P2I) ■ ■ ■ {H — pml) can be constructed in 0{rn? / B) I/Os, since 11 has exactly m subdiago- 
nals. 

When H is updated with the first Householder matrix, it becomes an upper Hessenberg 
matrix with an m x m bulge in the leftmost m columns. In the for-loop of the algorithm, H 
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is further reduced to the Hessenberg form using an OST. Instead of using Givens rotations 
as in single and double shift algorithms, here Householders are used to chase the m x m 
bulge. The direct Hessenberg reductions discussed in Chapter 4 could be used here. In 
particular, if Algorithm 14.21 were to be used, the total amount of data moved would be 
0{N'^{m + k)/B), where k is the slab width. 

li {m + k)k < M then the number of seeks is O {~^^^ + ^ m"^^'' ^ ■ 

6.2.2 A small-bulge variant of the implicit multishift QR algo- 
rithm 

As was mentioned in Subsection 16.1.11 because of shift blurring, the performance of the 
multishift algorithm is disappointing when the number of shifts is more than ten |114] . 
which is too few to make the algorithm rich in M-M operations. There are several small 
bulge variants of the QR algorithm [25l[671[73l[80]. But here we discuss the small bulge 
variant of the multishift QR algorithm described in [25]. This algorithm lines up a chain 
of m tightly packed bulges, where each bulge is 2 x 2 as only double implicit shifts are used 
in each bulge to avoid complex arithmetic; that is the number of shifts is 2m. 

Let {si,ti) G C X C, for 1 < i < m, be the m pairs of shifts; either 'Si = ti or {si,ti) G 
M X M. 

The algorithm has three phases. In the first phase, a tightly packed chain of 2 x 2 bulges 
is introduced at the top left corner of the matrix. In the second phase, the bulge chain is 
chased down to the bottom right corner along the diagonal. In the third phase, the chain 
of bulges is chased out of the matrix, restoring it to Hessenberg form. 

We now discuss each phase in detail. 

The first phase deals with the submatrix H[l : 3m +1; 1 : 3m + 1]. A 2 x 2 bulge is 
introduced at its top left corner using (si, ti) and is then chased to its bottom right corner. 
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Figure 6.1: Left:Introdcucing bulge chain from row 1 through s = 3m + 1. Right:Bulge 
chain at the bottom right corner from row r = N — (3m + 1) through N. 

Another bulge is then introduced using (^2,^2) and is then chased until it lies one column 
away from the first bulge. This is continued till m bulges are lined up. This requires the 
application of m(3m — l)/2 OSTs using 3x3 Householders on the submatrix. Next the 
Householders are accumulated into a thickly banded orthogonal matrix and the rest of H 
is updated from both left and right. This is illustrated in the left side of Figure 16. Ij the 
lightly shaded portion in the top left is the submatrix, and the darkly shaded panel to its 
right is the portion updated from the left. 

The second phase has a number of iterations, each of which chases the bulges by k rows 
towards the bottom right corner of H. Suppose at the beginning of an iteration, the bulges 
are in rows r through r + 3m. At the end of the iteration, the bulge chain will span rows 
r + A; to r + 3m + k. The submatrix H[r : r + 3m + fc; r : r + 3m + k], dealt in this iteration, 
is updated using the 3x3 Householders as soon as they are formed. At the end of the 
iteration, the Householders are accumulated into an orthogonal matrix and the rest of the 
matrix is updated from the left and right. This is performed as follows: 
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If f/ G M(3m+fc i)x(3m+fc 1) -g ^Y^e accumulate of the 3x3 Householders generated in the 
iteration, then define U as: 



Apply U to the submatrix H[r : 3m + k; r + 3m + k + 1 : N] from the left 

H[r : r + 3m + k, r + 3m + k + 1 : N] ^ U'^H[r : r + 3m + k, r + 3m + k + 1 : N] 

Similarly apply U to the submatrix H[l : r — 1; r : r + 3m + k] from the right 

H[l : r — 1; r : r + 3m + k] i— H[l : r — 1; r : r + 3m + k]U 

An iteration is illustrated in Figure I6.2( the darkly shaded portions are updated only from 
one side. 

If m << A^, and k << N, then the cost of the M-V operations (to update the submatrix 
H[r : r + 3m + k, r : r + 3m + k] and to accumulate U) is small compared to the cost of 
the M-M operations. 

In the third phase, the chain of bulges is chased off the matrix from their positions is 
H[N — 3m : N] N — 3m : A^]. Then the accumulate of the Householders is applied on the 
the vertical panel [1 : A^ — 3m ~ 1, N — 3m + 1 : A^] from the right. See Figure WA] the 
darkly shaded vertical panel is updated from the right. 

6.2.2.1 I/O and seek complexities 

It has been observed that k ^ 3m minimises the number of flops at 0{mN'^) [25j. If the 
bulge chasing window, i.e. H[r : r + 3m + A;, r : r + 3m + A;], along with matrix U fit in 
the main memory (i.e., m = O(v^M)), then the updates with the OSTs inside the window 
and update of U can all be performed in-core. The portions outside the window that are 
to be updated with U can be read in, updated in-core and written back. So the total I/O 
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Figure 6.2: Bulge chasing phase: bulge chain is chased from rows r through s = r + 3m 
down to rows r + k through s + k. 

cost would be 0{N^/B). The seek complexity of the algorithm would be 0{N^/m). (We 
assume that the matrix is in column major order. To access an 0(m x N) submatrix, 
0{N) seeks are needed. To access an 0{N x m) submatrix, 0{Nm? /M) — 0{N) seeks 
are needed. There are N/m such submatrices to handle.) 

If the bulge chasing window does not fit in the memory, then all the operations, i.e., 
shifting the bulge chain inside the small window, constructing the U matrix explicitly and 
updating the required portions from right and left are performed out-of-core. The V-V 
operations take 0{m?N j B) I/Os and the out-of-core M-M operations take O { jj^^ ■ So 
the total I/Os required by the algorithm is O + . The seek complexity in this 

case would be 0{N'^m/M + Nm^). (6(m^) OSTs are apphed inside the bulge window, 
each of which requires m seeks for a total of 0(m^) seeks per window. The updating 
of U requires fewer seeks. Updating the rest of the matrix costs a matrix multiplication 
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of dimensions 0(m), 0(m), and that incurs Q{m?N / M) seeks. There are Q{N/m) 
windows to process.) 

6.2.3 An improvement using tiles 

We now present a variant of the above algorithm of [25] that uses a tile partitioning of the 
matrix to improve the seek complexity. 

Suppose the input Hessenberg matrix H is tiled as in Algorithm 16.21 

Algorithm 6.2. 'I'ile partitioning for our small bulge multishift QR algorithm 

Input: An N x N Hessenberg matrix H in row/column major order. 
Output: A tile partitioning of H. 

Let w = 3m + 1 and h = {N — w)/{w — 1); 
Hn = H[l:w,l: w]; 
for j = 1 to h do 

= H[l ■.w,{w + l) + (j -l){w-l):{w + l)+ j{w - 1) - 1]; 

endfor; 

for i = 2 to h + 1 do 

Let X = {w + 1) + {i — 2){w — 1) and y = x + w — 2; 
Ha = H[x ■.y,x-3: y]; 
for j = i to h do 

H,j+i = H[x : y, y + (j - t)iw - 1) + 1 : y + (j - t + l)iw - 1)]; 
endfor; 
endfor; 



Here, without loss of generality, we assume that A^ = (3m + 1) + h{3m)] otherwise, the 
last diagonal tile would be of a smaller size. The partitioning has the following properties: 
The first block row is made up of w = 3m + 1 rows; each of the remaining block rows is 
made up of w — 1 rows. The diagonal tile of the first row is made up of the leftmost w 
columns; each of the other tiles of the first row is made up of w — 1 columns. For i > 1, the 
diagonal tile of the i-th block row spans w + 2 columns starting at column 2 + {i — l){w — 1); 
each of the other tiles of the i-th block row spans w — 1 columns. 
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Figure 6.3: Bulge chasing phase: bulge chain is chased from one diagonal tile to the next 
diagonal tile. 

We assume that the matrix has been " tile transposed" according to this partitioning; 
that is, the elements of each tile are stored contiguously. 

The small bulge multishift QR algorithm of |25] is applied as follows: Read Hu into 
the main memory; introduce m bulges in it, and construct the corresponding U all in-core. 
Then, read, update and write back i/i*, the first block row of tiles, one tile at a time. 
A bulge chasing iteration is depicted in Figure 16.31 read the lightly shaded tiles into the 
main memory, chase the bulge chain, and update the lightly shaded tiles using OSTs; also 
construct the corresponding U in the main memory. Now multiply the darkly shaded block 
rows and block columns with U from the left and right respectively. Continue this step 
until the bulge chain touches the bottom row. Then chase the bulge chain out of the matrix 
starting with the bottom-most bulge, construct the corresponding U and multiply the last 
block column with U from the right. This completes one QR iteration. 
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It is easy to see that the amount of data moved remains the same as in [25], namely, 
0{N'^/B) blocks if a bulge chase window fits in the main memory, and O + ^) 
blocks otherwise. 

However, there is an improvement in the seek complexity. In the algorithm of [25], if 
the matrix is stored in column major order, accessing a horizontal slab during the bulge 
chasing phase stage incurs 0{N) seeks each time. In our tile based implementation, if 
we choose m = 0{y/M), a horizontal slab can be read in and written in 0{N/m) seeks 
(as one tile can be read in with a single seek), for a total of 0{N'^ /m^) seeks, which is 
asymptotically the same as the cost of reading, updating and writing H when m = O(v^M) 

On the other hand, if m > \/M, then instead of performing the computations out-of- 
core, we do the following rescheduling. Choose m < m, and introduce the shift pairs in 
batches of m. Chase the bulges of one batch out of the matrix before introducing the next 
batch. Since the bulge chasing window fits in the main memory now, all computations 
are done in-core. Repeat this step for m/m times to complete one QR iteration of the 
algorithm. It can be easily seen that this modification generates the same bulges and the 
same sequence of 3 x 3 Householders as the algorithm of [25] . 

The I/O cost of processing one batch of shifts is 0{N'^/B). As there are m/m batches, 
the total I/O cost is O ^^^^f^j , which is O (^■^=^^ , when m = Q{^/M). Similarly, the seek 
complexity is O (^^^^, which is O (j^^^, when m = Q{\/M). 

Note that here the tile partitioning must be done with m instead of m. 

6.3 The QZ Algorithm 

In this section we consider some variants of the QZ algorithm. 
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6.3.1 The Implicit Multishift QZ Algorithm 

The exphcit or imphcit QZ algorithms [2l[371|58l|69l[nil86l[l09l[Il0l[n5] using one or two 
shifts chase 1 x 1 or 2 x 2 bulge pairs and therefore are rich in V-V operations. A multishift 
QZ algorithm that incorporates more V-M operations using m > 2 shifts is known [69] . 
(See Algorithm 16.31 ) 

The QZ algorithm is an extension of the QR algorithm; each QZ step effectively applies 
one QR step on HT^^, where {H,T) is the input pair. The implicit QZ algorithm avoids 
a costly, and possibly unstable, explicit computation of HT^^. 



Algorithm 6.3. The implicit multishift QZ algorithm I69f 

Input: An N x N properly-Hessenberg-Triangular matrix pair {H,T) and the number of 
shifts m G [2, A^] . 

Output: Orthogonal matrices Q, Z E M^^^ such that {Q^HZ, Q^TZ) is equivalent to 
{H,T), which is overwritten by {Q^HZ,Q^TZ). 

Notation: Let ^j{x) denote a Householder matrix that maps the last N — j elements of 
vector X G to zero without modifying the leading j — 1 elements. 



Determine m shifts pi, . . . , p^; 

X = {HT-^ - pJ){HT-^ - P2I) ■ ■ ■ {HT-^ - p^I)e,; 

Q = :K,{x); {H,T) = {QH,QT); 

Z = 'K,{T-^e,); {H,T) = {HZ,TZ); 

for j = 1 to A^ - 2 do 

g = :K,+i(ife,); iH,T) = {QH,QT); Q = QQ; 

Z = J{,+i(T-Vi)/ {H,T) = {HZ,TZ); Z = ZZ; 
endfor 



There are several ways in which the m shifts can be chosen. For example, they could 
be the generalised eigenvalues of the bottom right m x m matrix pair; these are called 
the generalized Francis shifts. Once the shifts are chosen, the first column a; of 11 = 
{HT^^ — piI){HT^^ — P2I) ■ ■ ■ {HT^^ — pml) can be constructed in 0{m^ / B) I/Os, since 
n has exactly m subdiagonals. 
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Apply !Ki{x) to the matrix pair from the left. That creates an mxm bulge in T. The first 
column of the bulge is chased using an opposite Householder transformation |37 p 69 p 74 pil7] : 

IfT is an invertible matrix, then the first column ofT!Ki{T~^ei) is a scalar multiple ofci. 

Applying the right Householder creates bulge in matrix H from column 1 to m + 1. 
Then left and right Householders are alternatively applied to chase the bulge pair down 
along the subdiagonal, out of the matrix pair, restoring the matrix pair {H, T) to the 
Hessenberg- Triangular form. 

If m > a/M, then the matrix inversions have to be computed out-of-core; this would 
add extra I/Os to the computation. So let us assume that m < a/M. The algorithm takes 
0{mN'^) flops. The I/O complexity of the algorithm is 0{mN'^ / B). The seek complexity 
is 0(A^2) since m < a/M. 

6.3.2 A small-bulge variant of the implicit multishift QZ algo- 
rithm 

In [69], a small bulge variant of the multishift QZ algorithm is proposed, which is a gener- 
alisation of the small bulge variant of the multishift QR algorithm of [22] . This algorithm 
chooses Hg = 2 oi 4, and then lines up a chain of m tightly packed bulges, where each bulge 
is Us X Us] that is the number of shifts is n^m. 

This algorithm too has three phases that are analogous to the three phases of the 
multishift QR algorithm of [23]. However in the QZ algorithm, when the matrix pair 
{H, T) is multiplied from the left with a Householder matrix meant to chase a bulge in 
H, a bulge appears in T, and when that is chased by multiplying the pair from the right, 
a bulge appears in H. Apart from this, the two algorithms are similar, with the same 
asymptotic I/O and seek complexities. So we omit a detailed description and analysis of 
the QZ algorithm. 
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Algorithm 6.4. 'I'ile partitioning for our small bulge multishift QZ algorithm 

Input: A Hessenberg- Triangular matrix pair {H,T) in row/column major order. 
Output: A partitioning of {H , T) . 

Let w = miris + 1) + 1 and h = {N — w)/{w — 1); 
Hii = H[l:w,l: w], Tu =T[l:w,l- w]; 
for j = 1 to h do 

= H[l ■.w,iw + l) + (j - l)iw -l):iw + l)+j{w - 1) - 1]; 
Ti,+i = T[l : ^, + 1) + {j + j{w - 1) - 1]; 
endfor; 

for i = 2 to h + 1 do 

Let X = (w + 1) + (i — 2)(w — 1) and y = x + w — 2; 
Ha = H[x : y, X - {ns + I) : y]; 
Tu = T[x -.y, X- (ris + 1) : y]; 
for j = i to h do 

Hij+, = H[x:y,y + (j - i){w - 1) + 1 : y + {j - t + l){w - 1)]; 

Tij+i = T[x : y, y + {j - i){w - 1) + 1 : y + {j - i + l){w - 1)]; 
endfor; 
endfor; 



6.3.3 An improvement using tiles 

The technique we used to design a tile based small bulge multishift QR algorithm can also 
be extended to this case. Without loss of generahty, assume that = m{ns + 1) + 1 + 
hm{ns + 1); otherwise, the last diagonal tile pair would be of lesser dimension than the rest 
of the diagonal tile pairs. Partition the Hessenberg- Triangular matrix pair as described in 
Algorithm 16. 4[ which is a straight forward generalisation of the partitioning we did for the 
QR algorithm. 

The small bulge multishift QZ algorithm of [69] is applied on the partitioned pair as 
follows: Read (iJii,Tii) into the main memory; introduce m bulge pairs in them, and 
construct the corresponding U and V (resp., the left and right accumulates) all in-core. 
Then, read, update and write back Hu and Tu, the first block rows of tiles, one tile at 
a time. A bulge chasing iteration is depicted in Figure 16.41 read the lightly shaded tile 
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Figure 6.4: Bulge chasing phase: chain of bulge pair is chased from one tile to another tile 
towards bottom right. 

pairs into the main memory, chase the bulge chain, and update the lightly shaded tiles 
using OSTs; also construct the corresponding U and V in the main memory. Now multiply 
the darkly shaded block rows and block columns with U and V from the left and right 
respectively. Continue this step till the bulge chain touches the bottom rows of {H,T). 
Then chase the bulge chain out of the matrix pair starting with the bottom-most bulge 
pair, construct the corresponding V and update the last block column of {H, T) with V 
from the right. This completes one QZ iteration. 

I/O and seek complexities remain same as in the tile based small bulge multishift QR 
algorithm. We omit the details. Note that the case of, number of shift pairs, mus > a/M 
can also be handled similarly, with similar results. 
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We have studied some problems of matrix computations. We analysed several known 
algorithms. We proposed some new algorithms too. The analyses were carried out on the 
external memory model of Aggarwal and Vitter with an intent of estimating the asymptotic 
I/O and seek complexities of the algorithms. 

A seek is the task of positioning the read/write head of the disk on the data block needed 
next. In reality, the cost of seeking a block will depend on the location of the block on the 
disk relative to the block being read or written now. We make the simplistic assumption 
that every block can be sought at the same cost. We define the seek complexity of an 
algorithm as the number of times the algorithm chooses to read or write a block that is 
not physically contiguous with the block being read or written now, thereby necessitating 
a seek. 

The I/O complexity of an algorithm is the sum of the number of blocks of data it 
moves between the two levels of memory, and its seek complexity. In all the algorithms 
we have seen, the former asymptotically dominates the latter for almost all choices of 
problem parameters. Therefore (and also because, new memory technologies (such as flash 
memory) tend to make switches in data localities cheap), in all the algorithms we designed, 
our attempt has been to optimise on the net data movement. 

Note that while translating our complexity bounds into real time on a disk system of 
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today, the data movement complexity is to be scaled using the disk access rate, while 
the seek complexity using the average of "seek time + latency"; the latter scaling factor 
is usually much larger. Also, reductions in the former are not always concomitant with 
reductions in the latter, as some examples in this thesis show. 

For the problems we consider in this thesis a lower bound of il(N^ /\/MB) on blocks of 
data to be moved is known [13]. We demonstrate matching upper bounds for most case. 
In particular, we present a new algorithm that matches the lower bound for Hessenberg 
reduction. 

Two ratios of interest are the data utilisation ratio defined as the number of operations 
performed per element moved, and the data locality ratio defined as the number of blocks 
of data moved per seek, even though neither is a suitable performance metric. 

A matrix multiplication on two txt matrices, for t = 0(-\/M), reads in two txt matrices 
at the cost of 0{t^/B) I/Os, and performs 0{t^) operations on them. With t = 0(a/M), 
the data utilisation ratio is maximised at G(a/M) operations per element moved. When 
t = u{\/M), a matrix multiplication performs 0{t^) operations and 0{t^ / B\/M) I/Os. 
The data utilisation ratio remains the same. The lower bound of on the amount of data to 
be moved for matrix problems [13j, sets an upper bound of 0{\/M) on the data utilisation 
ratios of their algorithms. 

The data locality ratios of matrix transposition and multiplication algorithms discussed 
in Chapter 1 are 0(a/M /B). It is not known whether algorithms of larger ratios exist for 
these problems. We, however, demonstrate algorithms for other matrix problems (for eg., 
QR decomposition) with data locality ratio as high as 0{M/B). 

The I/O performances of slab based algorithms, which are not quite state of the art 
anymore, depend on the the slab widths chosen. To get the best out of these algorithms, 
we find, surprisingly, that it is not always necessary that the slabs fit in the main memory. 
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(See Chapter 2 on QR-decomposition, for example.) Also, for sufficiently large matrices, 
with the matrix multiplications implemented using an optimal blocked OOC algorithm, 
we find, the slab based algorithm for QR decomposition runs with an I/O complexity of 
0{N^/By/M). That means existing implementations of slab based algorithms may not 
have to be completely redesigned (into, for example, the tile based algorithms) for good 
OOC performance. If some of the elementary matrix operations in them (for example, ma- 
trix multiplication) are handled I/O efficiently, asymptotically optimal I/O performances 
might be achievable for most inputs. While this is true for QR-decomposition, it is not for 
Hessenberg reduction, as we have shown. 
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